Bayesian prior via cross validation

Question

I have a particular problem where I am using Bayesian techniques to estimate parameters of a distribution of a random variable.

I would like to use an external source of data to determine an appropriate prior distribution for the analysis (which will then be updated with internal data). Let's denote this external data as

$$\boldsymbol{X}=\{X_{1},X_{2},\ldots,X_{n}\}$$

Now, ideally I would like to use this data to determine the prior distribution of an underlying parameter vector

$$\boldsymbol{\Theta}=\{\theta_{1},\theta_{2}\}$$

My current approach is:

For $m=\{1,\ldots,M\}$ iterations:
1. Take a random subset of $\boldsymbol{X}$, denote this as $\boldsymbol{X}_{m}$.
2. Estimate the parameters using $\boldsymbol{X}_{m}$, giving $\hat{\boldsymbol{\Theta}}_{m}$.

The above process provides $M$ estimates of $\boldsymbol{\Theta}$ and provides a prior distribution for our analysis.

I feel this a reasonable approach as it should be quite robust as we have randomly sampled from our original data each time we estimate. Obviously, the analysis will depend on the size of the subsample. I view this as some sort of cross-validation implicit in the estimation.

Is the above a reasonable approach to determining a prior distribution from an external source of data?

Additional Essentially, I want the external data to construct prior distributions for each of the parameters. The external data contains information that the internal data is lacking and therefore would like the priors to be a starting point for the Bayesian analysis which the internal estimates will be based on.

Just for clarification, the model had been set up as follows:

where the iterative procedure is given below:

I do not quite understand the added value of separating your external data to $M$ subsets to do $M$ point estimates, whereas you could use all your external data (in one go) to form your prior distribution. — Zhubarb, Apr 05 '18 at 08:01

Dave Harris · Accepted Answer · 2018-04-05T04:24:38.317

Yes, it is an excellent method. Usually, priors are built from less. That is a very defensible solution to constructing a prior density.

EDIT See Bayesian Data Analysis, Third Edition by Andrew Gelman, John B. Carlin, Hal S. Stern, David B. Dunson, Aki Vehtari, Donald B. Rubin

see the segments on the use of historical data and informative priors in the index

See also Introduction to Bayesian Statistics by William M Bolstad and James M Curran. See the chapter on robust Bayesian methods.

There is an article by Jaynes at http://bayes.wustl.edu/etj/articles/highly.informative.priors.pdf on highly informative priors.

There is also this article at Hamra, Ghassan et al. “Integrating Informative Priors from Experimental Research with Bayesian Methods: An Example from Radiation Epidemiology.” Epidemiology (Cambridge, Mass.) 24.1 (2013): 90–95. PMC. Web. 5 Apr. 2018.

EDIT How interesting, I am working on a similar problem assuming LGD is loss given default.

So, since I do not know your methodology, but I presume it is some form of regression, I would recommend looking at the Bolstad method of mixing prior results from the literature with a flat prior. These leaves the center of location intact but spreads out the uncertainty.

To provide an example, let us imagine you know that $\hat{\beta}_x=1.23$ in the literature with a variance of $.000001$ so it is estimated to a precision less than your least significant digit. You could construct a prior distribution around the literature's center of location but with a high variance. How high is high would, of course, depend upon the scaling of your literature, but you want it to be high enough that you have captured all reasonable estimates of the parameter in the dense region of your prior.

This is the difficulty of Bayesian methods, they depend on judgment to some extent. How the joint distribution of parameters works will also be a judgment call. While you could take the naive Bayes approach of no covariance if there is a published covariance matrix then it should be the starting point of your search.

What you are wanting your Bayesian method to do is guide your estimator into the region that is most likely the true region. You should add variance to your estimators to account for the cultural and legal differences surrounding losses.

I would love to give you a clear answer that every editor in the journals would take, but the most you can do is hunt the statistical literature for "informative prior" and "highly informative prior." There would also be content on robust methods and mixture priors.

Gelman's section on the use of historical data would also be of use. He anticipates a body of literature from which to construct the prior. This is problematic since no one is actually performing the same exact research in question.

One approach that could be taken is to grab Cox's axiomatic approach where probability is grounded in logic. Hence, you need to construct a logical framework for the construction of the prior. Hence your real job is disclosure and reasonableness and not a precise algebraic formulation.

Thanks Dave. Do you know of any references or literature surrounding this type of approach? I'm trying to build a case for using it but was hoping for something a bit more formal or rigorous. — epp, Apr 05 '18 at 01:38
I will look later tonight. Any information from outside the sample itself isn't in dispute as a source of prior construction. You may not take the implied density itself if there are differences that you believe exist between the outside data and your data, but, to provide an example, you certainly could use Canadian or British corporate bankruptcy rates as prior estimators of US rates if you also mixed them with a diffuse prior. In essence, you would keep the location but increase the spread to allow for cultural differences. — Dave Harris, Apr 05 '18 at 01:41
Thanks again. I think my application is somewhat similar to the example you mention. I am modelling a given country's LGD and am using another country's experience as the prior belief. — epp, Apr 05 '18 at 03:59

Tim · Answer 2 · 2018-04-10T08:15:30.433

4

If you have $n$ samples, then you can use Bayesian updating, to update your priors sequentially, by starting with a prior $\pi(\Theta)$, then you can use Bayes theorem to estimate the posterior

$$ \pi_1(\Theta) = \pi (\Theta | X_1) \propto p(X_1 | \Theta)\; \pi(\Theta) $$

to update your knowledge given $X_2$ sample, you take

$$ \pi_2(\Theta) = \pi (\Theta | X_1) \propto p(X_2 | X_1, \Theta)\; \pi_1(\Theta) $$

etc., using the general

$$ \pi_{n+1}(\Theta) = \pi (\Theta | X_{n+1}) \propto p(X_{n+1} | X_1, \dots, X_n, \Theta)\; \pi_n(\Theta) $$

what would be equivalent to updating all at once

$$ \pi(\Theta|X_1,\dots,X_n) \propto p(X_1,\dots,X_n|\Theta) \; \pi(\Theta) $$

To give an example, consider a beta-binomial model, where the posterior distribution is

$$ \alpha,\beta|x \sim \mathcal{B}(\alpha + x, \,\beta + n-x) $$

so if you had two samples of sizes $n_1,n_2$ and observed $x_1,x_2$ successes, then first you'd update the prior parameters $\alpha,\beta$ to $\alpha+x_1$ and $\beta+n_1-x_1$, and then, using second sample, to $(\alpha+x_1)+x_2$ and $(\beta+n_1-x_1)+n_2-x_2$, what would be equivalent to updating all-at-once: $\alpha + (x_1 + x_2)$ and $\beta + (n_1 + n_2) - (x_1 + x_2)$.

This directly translates to your case, where you have the "external" $X_1,\dots,X_n$ samples and want to use them to generate posterior for the "internal" sample $X_{n+1}$. Basically, this is the general idea behind Bayesian approach, where you can use your initial, or previous, knowledge and include it into your model as a prior, that is updated given new data.

What follows, is that

using posterior obtained using $X_1,\dots,X_n$ to update given $X_{n+1}$ is a perfectly valid way to go,
if for some reasons you need to proceed sequentially (e.g. the data comes sequentially), then this is a valid way to go,
it doesn't make much sense to obtain the $n$ independent "prior" estimates of $\Theta$ given $n$ samples, since you can proceed all-at-once,
in fact, you shouldn't make $n$ independent estimates using the same prior in each case, since if then you somehow aggregated the results, then in the final result you'd include your prior $n$ times,
you shouldn't use the same data to "estimate" the prior (so $X_{n+1}$ really needs to be new data), since then the same information would be used twice and you would end up with the result that is overconfident (while the point estimates wouldn't change, the posterior distributions and interval estimates would be too narrow).

On another hand, if what you are asking is if using frequentist approach on the external data, to estimate the parameters for the priors, is a valid way to go, then the answer is still: yes, we often use external data, or previous results to create informative priors. However a Bayesian would still argue, that there is no reason for using frequentist approach in here. You can go all the way using Bayesian approach: starting with some prior, updating it using the external data, then using the posterior as a prior for a model using the "internal" data.

edited Apr 10 '18 at 08:15

answered Apr 05 '18 at 06:49

Tim

108,699
20
212
390

Tim, I think I haven't been clear. I'll clarify my problem: I have data. I want to estimate the parameters of this data. Obviously, for a Bayesian approach I need to specify a prior for these parameters. I wanted to use *external* data to determine these priors. Thus, I create a prior distribution for these parameters by performing the sub-sampling routine outlined above. – epp Apr 05 '18 at 07:15
@StatsPlease what do you mean by "determining the priors"? I understand this as estimating the distributions of the priors, and this is what I described above. – Tim Apr 05 '18 at 07:17
My approach was to determine appropriate prior distributions for the parameters of interest that were based on external data. In a way, the prior distributions were going to be determined in a frequentist fashion with sub-sampling and parameter estimation within each iteration. Once I had these priors I would then perform a Bayesian analysis whereby I take my internal data (the data of interest) and update those priors to obtain my posterior. Are you saying this approach is invalid? – epp Apr 05 '18 at 07:35
@StatsPlease so you'd like to obtain something like empirical distributions for the priors using bootstrap-like approach? How exactly would you pass those empirical distributions to your model and update them (it wouldn't be possible off-the-shelf by any Bayesian software)? Could you give a concrete, non-abstract, example of the procedure you're talking about? Why exactly you want to use the (approximate) empirical distribution instead of a functional prior distribution? – Tim Apr 05 '18 at 07:39
Yes, I was hoping to use the external data to construct a prior (the external data contains useful information that the internal data doesn't). Obviously, if I just fit a distribution to the external data I get a single pair of estimates of the parameters $(\theta_{1},\theta_{2})$. I want a *distribution* for each of these parameters. Thus, I thought performing the sub-sampling was a nice, robust way to get some variation around these parameters and, as a result, getting a distribution for each of the parameters. These distributions would then be used as the *priors* in the Bayesian analysis. – epp Apr 06 '18 at 01:50
Keep in mind the application is an economic/financial one. Thus, the sampling procedure represents an exploration of the question "*What if we hadn't observed certain idiosyncratic and systematic aspects of the economic cycle that influence the variable of interest*?" – epp Apr 06 '18 at 01:53
@StatsPlease Since this is a pretty nonstandard approach please edit to give us more details on: (1) how exactly do you want to estimate the priors? (2) how do you want to combine the priors from multiple subsamples? (3) how do you want to feed the priors into the Bayesian model? (4) what exactly is the model? (5) how do you want to update the priors, since the "internal" data "doesn't" contain the needed information? This is crucial info to answer this. Why don't you want to use standard Bayesian approach as described in both answers? – Tim Apr 06 '18 at 06:09
The external data and internal data are from different countries. Thus, I expect the parameters to move in a certain direction when updating with the internal data. I'm trying to use the external data as a reasonable starting point for the parameters. Given the relative size on the samples (external is much larger than internal), I feel that the priors will be too narrow from the first update with external and then shift very little using the internal data. By subsampling and fitting each time, I was constructing an empirical prior with more spread in the parameter space. – epp Apr 08 '18 at 23:43
(1) Iterate: subsample, fit MLE. (2) Each subsample contributes one $(\theta_{1},\theta_{2})$ pair to the single prior distribution i.e. the prior is constructed from all the pairs from the iterations. (3) In the standard way. (4) The model is quite simple, just updating a prior with data. I am using MCMC to estimate the posterior (MH within Gibbs). (5) So the internal data doesn't contain a recession, whereas the external does. I am trying to incorporate that stress into the internal data. – epp Apr 08 '18 at 23:51
(1) MLE is a single point, not a distribution, (2) I don't know what do you mean by this notation, (3) what standard way?, (4) how exactly you use the empirical (?) distribution in here, (5) this is also still unclear. If you want to keep it that vague, then the above answer is applies since it also describes the standard way of dealing with such cases. – Tim Apr 09 '18 at 04:11
I've added some graphics in the question to (hopefully) clear up any confusion. – epp Apr 09 '18 at 04:14
@StatsPlease and what I described perfectly fits your diagram, so if it doesn't answer your question, I don't know what you're doing... What exactly does your "iterative procedure" return? Is it empirical distribution? Or parameter estimates for functional distribution? – Tim Apr 09 '18 at 05:12
The iterative procedure fits a functional distribution to the random sample, giving parameter estimates. After $n$ iterations, you have $n$ pairs of parameter estimates i.e. you have a distribution for the parameters (which is then used as the prior in what follows). – epp Apr 09 '18 at 05:51
@StatsPlease so say, you have n=2 and fit normal distribution, you end up with two pairs (m=0.123423, s=1.23111) and (m=0.1111234, s=0.998787), so you say that the distribution of the parameters is for m : 0.123423 with probability 1/2 and 0.1111234 with probability 1/2 and all the other values with probability 0 ...? – Tim Apr 09 '18 at 06:00
Essentially, however you would run a large number of iterations and then for each parameter $(m,s)$ you would be able to fit a nice, continuous distribution. This is then your prior $\pi_\Theta(\theta)$ for the analysis. – epp Apr 09 '18 at 06:53
@StatsPlease if the parameters are continuous, then probability of seeing any of the values more then one is zero, moreover the distribution would be discontinuous. How exactly do you fit a continuous distribution to those parameter estimates? I guess you use KDE or something in here? As I said, your description is very abstract and it is unclear what is the exact procedure you want to follow. If you want equally abstract answer, then my answer applies to your description. – Tim Apr 09 '18 at 07:01
Let us [continue this discussion in chat](https://chat.stackexchange.com/rooms/75752/discussion-between-statsplease-and-tim). – epp Apr 09 '18 at 23:56

score 0 · Answer 3 · answered Oct 17 '19 at 20:11

This thread is somewhat old, but this is a problem that I have been giving some thought to and wanted to suggest an approach.

Let's say you have external data $X_1, \ldots, X_n$ and internal data $X_{n+1}, \ldots, X_m$, a likelihood function $p(X_1, \ldots, X_n \mid \theta)$ and a prior distribution $\pi(\theta \mid h)$. Your goal is to use the external data to estimate $h$, then update your prior using the internal data.

You could do this using an empirical Bayes type approach. That is, you first calculate the likelihood of the hyperparameters $h$, given the external data. Then, you estimate $h$ using maximum likelihood estimation. $$p(X_1, \ldots, X_n \mid h) = \int p(X_1, \ldots, X_n \mid \theta)\pi(\theta \mid h) d\theta$$ $$\hat{h} = \max_h p(X_1, \ldots, X_n \mid h)$$

You then proceed in the usual Bayesian way, updating the prior based on the internal data.

$$\pi(\theta \mid \hat{h}, X_{n+1}, \ldots, X_m) \propto p(X_{n+1}, \ldots, X_m \mid \theta) p(\theta \mid \hat{h})$$

Bayesian prior via cross validation

3 Answers3

Linked