3

In the book Uncertainty Quantification: Theory, Implementation, and Applications, by R.C. Smith, there is a chapter about Bayesian inference. The likelihood is Gaussian, with error variance $\sigma^2$ that may be constant or random: $Y|q,\sigma^2\sim\prod_{i} \text{Normal}(f_i(q),\sigma^2)$. Here $Y$ (response) and $q$ are random vectors, and $f_i$ are deterministic functions. The prior distributions are $q\sim\pi(q)$, $\sigma^2\sim\pi(\sigma^2)$. The author shows the Metropolis algorithm for the simulation of the posterior distribution, $\pi(q,\sigma^2|d)$, where $d$ is the data on the response. If you do not have the book, nothing happens, as I will ask the question self-contained.

The Metropolis algorithm requires the use of a proposal distribution. Although the Markov chain converges for any proposal distribution, the choice becomes important in terms of convergence rapidity. The author proposes that the proposal distribution be Gaussian, with covariance matrix given by the covariance matrix of the estimator from nonlinear regression (frequentist method). That is, if $J(\theta'|\theta)$ is the proposal distribution, then $J(\theta'|\theta)\sim\text{Gaussian}(\theta,V)$, where $V=\hat{\sigma}^2(X^T(\hat{q})X(\hat{q}))^{-1}$, being $\hat{q}$ the maximum likelihood estimator (least-squares fitting), $\hat{\sigma}^2$ the estimated error variance from the nonlinear regression (quotient of mean square error and number of observations minus parameters), and $X$ the Jacobian matrix of the map $\{f_i\}_i$ that defines the model (if it were linear regression, it would just be the matrix of covariates). Hence, this method acts in the spirit of empirical Bayes (the data is employed beyond the likelihood).

My question is whether, for complex models, there are other choices of the proposal distribution that work. It seems that the shape of the proposal distribution should be as similar as possible to the unknown posterior distribution. As the asymptotic behavior of the Bayesian estimator is Gaussian with covariance matrix precisely estimated as such $V$, I wonder if empirical Bayes is "optimal" in this setting.

user269666
  • 245
  • 2
  • 7
  • Empirical Bayes being a vague notion, where the hyperparameter estimate can be absolutely anything, it is hard to dream of an "optimal" choice. If nothing else, the choice of the parameterisation matters. – Xi'an Dec 31 '19 at 11:54
  • @Xi'an So let's forget about "optimality", as it is a too strong word. The book proposes to use as proposal distribution one that is estimated employing the data (specifically a least-squares fitting). As the proposal distribution has a similar shape to the posterior distribution in this manner, this seems a good option. I am asking whether there are other possibilities for the proposal distribution that do not rely on the data and give good convergence for complex models. As far as I am concerned, the classical notion of Bayesian analysis says that data should only be used for the likelihood. – user269666 Dec 31 '19 at 12:05

1 Answers1

2

There is a major distinction between the data being used to determine the prior (bad!) and the data being used to determine the Metropolis proposal (good!), as the former is for inference (double use of the data) while the later is for simulation (manageable approximation of the posterior). Hence, there is nothing wrong in using a Normal approximation to the posterior distribution as a proposal. In the ideal case were the posterior distribution available for simulation, the Metropolis algorithm using this posterior as proposal would be correct and produce iid simulations from the posterior.

Xi'an
  • 90,397
  • 9
  • 157
  • 575
  • Thank you for your answer, I understand now. Just a small query. Why is it bad to employ the data for the prior? If the prior distributions are centered at the maximum likelihood estimators, it seems that the convergence of the Markov chain can be reached faster. If the sample size is large, does using the data for the prior become less questionable? – user269666 Dec 31 '19 at 14:10
  • Just make this thought experiment: if the posterior is a better prior, iterate the process. What do you end with? As to whether it matters, a highly interesting [paper by Petrone, Rousseau and Scricciolo (2012)](https://hal.archives-ouvertes.fr/hal-00767467/document) studies the connection between empirical Bayes "posteriors" and actual Bayes posteriors, deriving approaches to asymptotically agree (in the number of parameters n). – Xi'an Dec 31 '19 at 14:16
  • Let $\theta$ be the set of parameters and $L(\theta|y)$ be the likelihood function. If you iterate the posterior with different sets of data $y_1,\ldots,y_m$, you have $$\pi(\theta|y_1,\ldots,y_m)\propto L(\theta|y_m)\pi(\theta|y_1,\ldots,y_{m-1})\propto\ldots\propto\left(\prod_{i=1}^m L(\theta|y_i)\right)\pi(\theta).$$That is, all the likelihoods get mutiplied. I do not understand what this fact entails in practice. – user269666 Jan 01 '20 at 17:37
  • Ok. In such a case, if $y_1=\ldots=y_m=y$, then $$\pi(\theta|y,\stackrel{m)}{\ldots},y)\propto L(\theta|y)^m\pi(\theta).$$ The power $L(\theta|y)^m$ is strongly concentrated around the maximum likelihood estimator given $y$. So, by employing the data for the prior distribution, we are directing the posterior distribution towards the maximum likelihood estimator more strongly. But is this necessarily a problem? Is not the asymptotic distribution of the Bayesian estimator (with the length of $y$) centered at the maximum likelihood estimator? – user269666 Jan 01 '20 at 18:52
  • The problem is that the spread of the posterior shrinks to nothing and hence the posterior looses its ability to express uncertainty. Which is its principal raison d'être... – Xi'an Jan 01 '20 at 20:24
  • Thank you. And what about estimating prior distributions via the maximum entropy principle? Given a parameter $\theta$, the maximum entropy principle searches for the density function that maximizes $$S(\theta)=-\int_{-\infty}^\infty f_\theta(\theta)\log f_\theta(\theta)d\theta,$$ subject to constraints $E[\theta^k]=\int_{-\infty}^\infty \theta^k f_\theta(\theta)d\theta=f_k$, where $f_k$ are prespecified moments, $k\geq1$. If only information on $f_1$ is available and the support of $\theta$ is $[0,\infty)$, for instance, then $f_\theta$ is the density of an Exponential distribution... – user269666 Jan 02 '20 at 16:04
  • ... Thus, given a least-squares fitting for $\theta$ (using the data), this value may be regarded as the mean of $\theta$, $f_1$, and then take as prior distribution an Exponential with mean $1/f_1$. – user269666 Jan 02 '20 at 16:04
  • There is no unicity in the maximum entropy principle as it depends on the moments being constrained. (Not to mention the dominating measure!) – Xi'an Jan 02 '20 at 16:24