Can "cross-validation" be used to choose a prior?

Question

To be clear, I doubt I am using the term "cross-validation" correctly here; what I am suggesting also seems similar to "boot-strapping" and "hyperparameter tuning". Terminology is not my strength.

Let's say we have a data set with $20$ observations, $D_1, \dots, D_{20}$. We don't know what prior to use for the data set, so we decide to use the maximum entropy prior given the population mean and variance, i.e. a normal prior. (This of course assumes the population distribution has finite second moment. I am not convinced that this assumption is innocuous, but it is common.)

But of course we don't know the population mean and population variance, so we estimate them. We can't use all of the data to estimate them, because then there wouldn't be any data left to do our inference on. So let's say we use observations $D_1, \dots, D_{15}$ to get an estimate $\hat{\mu}$ for the population mean $\mu$ and an estimate $\hat{\sigma}^2$ for the population variance $\sigma^2$. Then we choose $\mathscr{N}(\hat{\mu}, \hat{\sigma}^2)$ as our prior and then use the remaining $5$ observations $D_{16}, \dots, D_{20}$ to do inference with our prior.

No one would like this situation because we are no longer using all of our data for inference. So:

Question: In this situation would it make sense to:

Calculate priors $\mathscr{N}(\hat{\mu}_1, \hat{\sigma}^2_1)$, $\mathscr{N}(\hat{\mu}_2, \hat{\sigma}_2^2)$, $\mathscr{N}(\hat{\mu}_3, \hat{\sigma}_3^2)$, $\mathscr{N}(\hat{\mu}_4, \hat{\sigma}_4^2)$, the first using the exact same procedure above, the second using an analogous procedure with $D_{11}, \dots, D_{15}$ as the "holdout set", the third using an analogous procedure with $D_{6}, \dots, D_{10}$ as the "holdout set", and the fourth using an analogous procedure with $D_1, \dots, D_5$ as the "holdout set",

Choose as our prior either (a) the convex combination of these above four priors, which would be a Gaussian mixture model I guess, or (b) take as our prior $\mathscr{N}(\tilde{\mu}, \tilde{\sigma}^2)$, where $$\tilde{\mu} := \frac{1}{4}(\hat{\mu}_1 + \hat{\mu}_2 + \hat{\mu}_3 + \hat{\mu}_4 ) \,, \quad \tilde{\sigma}^2 := \frac{1}{4}(\hat{\sigma}^2_1 + \hat{\sigma}_2^2 + \hat{\sigma}_3^2 + \hat{\sigma}_4^2) \,?$$

The above example generalizes readily, of course (for example I didn't even specify a particular method to get our estimates $\hat{\mu}$ and $\hat{\sigma}^2$), but I thought I would use this concrete example because I don't think I could explain myself clearly in full generality.

(Actually I even doubt that the above concrete example is explained clearly.)

Additional questions: Does something similar to the above procedure already have an established name? And is there any literature either showing its lack of optimality properties or otherwise analyzing it theoretically?

I think this question is different from this related question because that question has both an internal and external source of data. In this example, our parameter estimation and inference are both "competing" for the same data, so we use "cross-validation" or repeated sub-sampling ("bootstrapping") to accomplish the required "hyperparameter tuning" for the prior.

It is also different from the method suggested here which Andrew Gelman argued (probably convincingly, I don't understand the argument to be honest) does not work well. But that method is suggesting use of an "M-estimation" approach with cross-validation to get the prior, i.e. to select the "best-performing" prior from $\mathscr{N}(\hat{\mu}_1, \hat{\sigma}^2_1)$, $\mathscr{N}(\hat{\mu}_2, \hat{\sigma}_2^2)$, $\mathscr{N}(\hat{\mu}_3, \hat{\sigma}_3^2)$, $\mathscr{N}(\hat{\mu}_4, \hat{\sigma}_4^2)$, whereas I am suggesting using some combination of them. This makes more sense to me than the other method, in order to both (a) avoid "overfitting" and (b) to use more of the data to inform the choice of prior.

It sounds strange to me the basic concept of using your present data to set a prior. I always thought prior should come from some previous knowledge (i.e. you decide them before seeing your data), like from literature or 'prejudice'. What do you hope to gain with your procedure in comparison to setting a flat prior and use all your data for inference (in case you have no prejudice/idea how to set the priors)? Also, in your procedure, aren't you using your data 'twice' (which would lead to an underestimate of final uncertainty)? — fabiob, May 24 '18 at 08:23
@fabiob If you use a flat prior, you can't make credible intervals. Also, this does use all of the data for inference; that's literally the entire point. There is always unquantifiable uncertainty in the choice of the prior, so that is also a moot point. — Chill2Macht, May 24 '18 at 22:41
ok, then change flat for broad gaussian. I cannot imagine a quantity for which you have no idea, even of the order of magnitude. and my point was not that you do not use all the data for inference, but that you use them twice. — fabiob, May 25 '18 at 07:36
@fabiob Why would a broad Gaussian be appropriate if your sample suggests that the population variance is small? How do you choose a Gaussian that is "broad" enough? Anyway, data is used twice all of the time for bootstrapping and cross-validation and other methods. The whole point of the question is that the double usage proposed is essentially the same as in those methods. The manner in which it is used twice is specifically structured so as to mitigate possible overfitting. The only reason to avoid using the data twice is to avoid overfitting, so that concern strikes me as being moot. — Chill2Macht, May 25 '18 at 15:37

score 8 · Answer 1 · answered May 24 '18 at 12:35

Since prior etymologically signifies before:

prior

adjective

existing or coming before in time, order, or importance.

"he has a prior engagement this evening"

synonyms: earlier, previous, preceding, foregoing, antecedent, advance, preparatory, preliminary, initial.

using the data to build the prior is not correct within a Bayesian perspective. It is found however in the "empirical Bayes" methodology, initiated by Robbins (1955) and defended by Efron, which uses first the data to estimate the parameters in a prior, like your Normal example, and a second time to run a pseudo-Bayesian analysis as if the prior was a true prior. Some versions of this approach enjoy convergence properties, for instance in semi- and non-parametric settings.

The question however seems to shy away from this solution by making a single use of the data and separating it into learning and inference parts. This is connected with the construction of intrinsic Bayes factors in the 1990's, by Jim Berger and co-authors, where a fraction of the data is used to make a flat (or otherwise improper) prior into a proper posterior, and use the remaining fraction to compute a Bayes factor (and run a test decision). In order to avoid the choice of the partition impacting the final result, all possible permutations are considered and one form of average (among arithmetic, geometric, harmonic, median) is computed. A much more elegant alternative is O'Hagan's (1995) fractional Bayes factor where the likelihood $L(\theta)$ is replaced with a fractional power $L^\alpha(\theta)$, which is used to create a posterior and this posterior is then used as a prior for the remainder of the likelihood $L^{1-\alpha}(\theta)$. The difficulty with these approaches is in determining the "right" amount of partitioning, e.g., the value of $\alpha$.

score 2 · Answer 2 · answered Jan 25 '20 at 01:09

2

In this paper Andrew Gelman used cross validation on a corpus of datasets to propose a weakly informative prior, intended for routine use. That seems a sensible approach!

answered Jan 25 '20 at 01:09

kjetil b halvorsen

63,378
26
142
467

1

Thank you for pointing this out to me! I'm downloading it to read later – Chill2Macht Jan 25 '20 at 03:21

Can "cross-validation" be used to choose a prior?

2 Answers2