When does the maximum likelihood correspond to a reference prior?

Question

I have been reading James V. Stone's very nice books "Bayes' Rule" and "Information Theory". I want to know which sections of the books I did not understand and thus need to re-read further. The following notes which I wrote down seem self-contradictory:

The MLE always corresponds to the uniform prior (the MAP of the uniform prior is the MLE).
Sometimes a uniform prior is not possible (when the data lacks an upper or lower bound).
Non-Bayesian analysis, which uses the MLE instead of the MAP, essentially sidesteps or ignores the issue of modeling prior information and thus always assumes that there is none.
Non-informative (also called reference) priors correspond to the maximizing the Kullback-Leibler divergence between posterior and prior, or equivalently the mutual information between the parameter $\theta$ and the random variable $X$.
Sometimes the reference prior is not uniform, it can also be a Jeffreys prior instead.
Bayesian inference always uses the MAP and non-Bayesian inference always uses the MLE.

Question: Which of the above is wrong?

Even if non-Bayesian analysis does not always correspond to "always use the MLE", does MLE estimation always correspond to a special case of Bayesian inference?

If so, under which circumstances is it a special case (uniform or reference priors)?

Based on the answers to questions [1][2][3][4] on CrossValidated, it seems like 1. above is correct.

The consensus of a previous question I asked seems to be that non-Bayesian analysis cannot be reduced to a special case of Bayesian analysis. Therefore my guess is that 6. above is incorrect.

Re the title: The likelihood doesn't respond to a prior *at all*. The posterior does respond to changes in the prior of course. — Glen_b, Aug 20 '16 at 01:46
In 2, the uniformity relates to the parameter (or perhaps function of parameters) you're performing inference on, not the support of the data. In 3. Frequentist inference *can* bring in outside information (otherwise how would meta-analysis work?); it doesn't incorporate subjective priors but previous data (or its likelihood, if that's all that's available, or even just an approximation to its likelihood obtained from estimates and standard errors) can be readily incorporated. In 5. reference priors needn't be any of the things you mentioned - e.g. in mutivariate settings. 6- MAP is atypical — Glen_b, Aug 20 '16 at 01:49

score 8 · Accepted Answer · edited Aug 20 '16 at 01:13

Correct, as long as the support of the uniform prior contains the MLE. The reason for this is that the posterior and the likelihood are proportional on the support of the uniform prior. Even if the MAP and MLE coincide numerically, their interpretation is completely different.
False. The support of the prior is certainly location and scale dependent (e.g. if the data are reported in nanometers or in parsecs), but an appropriate choice is often possible. You may need to use a huge compact set as the support, but it is still possible.
It does not use prior information in the sense of a prior distribution (since they are completely different inferential approaches) but there is always information injected by the user. The choice of the model is a form of prior information. If you put 10 people to fit a dataset, some of them would probably come up with different answers.
Yes. Have a look at the following references

The formal definition of reference priors

Jeffreys Priors and Reference Priors

The reference prior and the Jeffreys prior are the same in uniparametric models (unidimensional parameter), but this is not the case in general. They are uniform for location parameters, but this is not the case of scale and shape parameters. They are different even for the scale parameter of the normal distribution (see my previous references).
False. Truly Bayesians use the posterior distribution in order to obtain Bayes estimators. The MAP is one of them, but there are many others. See Wikipedia's article on the Bayes estimator.

Non-Bayesians do not always use the MLE. An example of this is the James-Stein estimator, which is based on a different criterion than maximizing a likelihood function.

@William No worries at all, I am just a bird of passage. If you find the answer useful, that's enough for me. — Richard Price, Aug 20 '16 at 01:15
(1) is not true without some caveat. The MLE does not depend on the parameterization. But you cannot specify a "uniform prior" without giving a parameterization. For example, if I put a flat prior on $\sigma$ in a linear model, sure, the MAP estimator of $\sigma$ is also the MLE. But the MAP estimator of $\sigma^2$ is not the MLE of $\sigma^2$ under this prior. — guy, Aug 20 '16 at 02:01
@guy Indeed, the uniform prior has to be chosen under the parametrization of interest. — Richard Price, Aug 20 '16 at 02:04

Xi'an · Answer 2 · 2019-12-22T14:45:03.470

A few more remarks in addition to the points made by Richard Price:

The MLE always corresponds to the uniform prior (the MAP of the uniform prior is the MLE).

This is incorrect for a simple if often overlooked reason: the MLE does not require a dominating measure on the parameter space while a Bayesian approach does. This means that both "the" flat (constant) prior and "the" MAP are actually depending on the choice of the dominating measure. An alternative explanation (already made in a comment) is that the MLE is invariant by reparameterisation, that is under any bijective transform of the parameter, while a flat prior does not remain constant under bijective transforms and the MAP is not invariant by reparameterisation. My general view on MAPs is that they are not Bayesian procedures.

Sometimes a uniform prior is not possible (when the data lacks an upper or lower bound).

This is both correct and incorrect. Choosing a Uniform prior $\mathcal U(a,b)$ is always possible, but requires a choice of $a$ and $b$. If the prior density is constant over the entire parameter space (against the chosen dominating measure) then it is not a Uniform density because it is not a probability density. The prior then becomes improper, that is, a $\sigma$-finite measure.

Non-Bayesian analysis, which uses the MLE instead of the MAP, essentially sidesteps or ignores the issue of modeling prior information and thus always assumes that there is none.

This is too vague a statement to validate or invalidate. As signaled by Richard Price]1, the choice of a model is a type of information, which may become increasingly Bayesian when bringing in random effects for instance. Further, non-Bayesian analysis is not defined as an approach per se.

Non-informative (also called reference) priors correspond to the maximizing the Kullback-Leibler divergence between posterior and prior, or equivalently the mutual information between the parameter and the random variable .

Correct: Reference priors in the specific sense of Bernardo (1979), Berger, Bernardo, and Sun (2009), and others, are maximising the expected Kullback-Leibler divergence between prior and posterior for the parameter(s) of interest. As this is usually impossible when considering proper priors, it gets complicated.

Sometimes the reference prior is not uniform, it can also be a Jeffreys prior instead.

This is again a vague and not so useful statement. For the same reason as above, namely the lack of invariance under reparameterisation, the reference prior [assuming a specific definition of said prior] is almost never uniform. Jeffreys' approach enjoys invariance under reparameterisation in the sense that its definition is consistent under changes of parameterisation.

Bayesian inference always uses the MAP and non-Bayesian inference always uses the MLE.

This is incorrect, for both parts. Bayesian inference always uses the full posterior distribution and only derives procedures like point estimates in cases a decision is required and a loss function provided. MAP estimates are not available as decision-theoretic procedures. Non-Bayesian inference covers all possible answers to an inference problem and hence cannot be characterised.

When does the maximum likelihood correspond to a reference prior?

2 Answers2

Linked