Yet another "Bayesian vs Maximum Likelihood" question

Question

In the fully Bayesian approach, the predictive distribution is:

$$ P( Y|X ) = \int P(\theta | X ) P( Y | \theta ) d\theta $$

When the integral is difficult to compute, we might resort to the Maximum Likelihood approach, and approximate the predictive distribution as follows:

$$ P( Y|X ) \approx P( \hat \theta_{ML} | X ) P( Y | \hat \theta_{ML} ) $$

where $ \theta_{ML} $ is the MLE of the parameter $\theta$.

When you look at this, the ML approach looks like an awfully bad approximation. Why does it work not so badly in practice?

Note that "plugging in" MLEs doesn't take uncertainty in parameter estimates into account when making predictions. Profile predictive likelihoods would be a better illustration of non-Bayesian predictive inference. — Scortchi - Reinstate Monica, Feb 24 '14 at 13:43
@ OP - Why do you think that ML looks like an awfully bad approximation ? — meh, Aug 01 '17 at 11:12
One extra thing - the actual approximation should be $P (Y|X)\approx P (Y|\hat {\theta}) $. The other term is constant wrt $Y $ — probabilityislogic, Aug 01 '17 at 11:12
Perhaps it works for the same reason naive Bayes works- even though it is almost never statistically justified. — meh, Aug 01 '17 at 11:13

kjetil b halvorsen · Answer 1 · 2020-02-13T13:59:40.810

You should give a reference for your claim that the approximation obtained by simply replacing $\theta$ by its maximum likelihood estimator $\hat{\theta}$ is good. That approximation will forget about the uncertainty in the estimation of $\theta$, and might be a good approximation is some cases and bad in others. That must be evaluated on a case-by-case basis. It will mostly be bad when there are few observations. A particular case where it is bad is a binomial likelihood with $\hat{p}=0$.

One general approach to representing the uncertainty of estimation of $\theta$ is using the laplace approximation of the integral, an approach which should be better known. Start with the conditional density above in the form $$ \DeclareMathOperator*{\argmax}{arg\,max} f(y \mid x) = \int f(\theta \mid x) g(y \mid \theta) \; d\theta $$ (which assumes that $y$ and $x$ are conditionally independent given $\theta$). Write $u(y; \theta) = f(\theta \mid x) g(y \mid \theta)$ and $\theta_y =\argmax_\theta u(y; \theta)$, that is, the value of $\theta$ giving the maximum, as a function of $y$. Suppose also that the maximum is found by setting the derivative equal to zero. Then write the (negative) second derivative as $u_y''= -\frac{\partial^2}{\partial^2 \theta} \log u(y;\theta)$ (evaluated at the maximum $\theta_y$). We have then $u_y'' > 0$.

Using Taylor expansion (and forgetting the error term, to get an approximation) we have $$ \int f(\theta \mid x) g(y \mid \theta) \; d\theta = \\ \int \exp\left( \log u(y;\theta) \right)\; d\theta = \\ \int \exp\left( \log u(y;\theta_y)+\frac12 u_y'' (\theta - \theta_y)^2 +\cdots\right) \; d\theta \approx \\ u(y; \theta_y) \frac{\sqrt{2\pi}}{\sqrt{u_y''}} $$ We can summarize this that a better approximation for the (predictive) posterior of $Y$ is $$ p(y \mid x) \approx f(\theta_y \mid x) g(y \mid \theta_y) \frac{\sqrt{2\pi}}{\sqrt{u_y''}} $$ Note that $\theta_y$ can be seen as a maximum, likelihood estimator using both $x$ and $y$ as data. This idea is related to the use of a profile likelihood function. (The above development is assuming that $\theta$ is a scalar. The modification for a vector parameter is trivial). A paper-length treatment can be found here.

score 0 · Answer 2 · edited Jun 11 '20 at 14:32

I would like to wax philosophically on @kjetil's answer, and specifically this statement:

That approximation will forget about the uncertainty in the estimation of θ, and might be a good approximation is some cases and bad in others. That must be evaluated on a case-by-case basis.

The reason that we can use the MLE to good effect is because of the following two things:

The real world is sane.
We know that ``extraordinary claims require extraordinary evidence'' -Carl Sagan

The world is sane

What I mean by the first point is that if you make up an arbitrary problem, out of `all possible problems' in some sense, then the MLE is likely to be a terrible estimate. However, if you choose a real problem out of the set of problems that one might legitimately encounter in the real world, then the MLE works reasonably well because $P(\theta)$ is not unreasonable.

To illustrate, consider that we would like to estimate $\theta$, the probability of heads of some coin-of-unknown-fairness. Now, in order to even compute the Bayesian version, before we can start computing probabilities with respect to a dataset $X$ we first need to contemplate the world of possible coins. This world of coins in which we found our coin is essentially $P(\theta)$, our prior probability.

Ordinarily, this world is easy to contemplate, because we would have a real world coin that we need to estimate, and we live in the real world. However, in a non-real world, who knows what manner of strange and magical coins there be? In a particular weird and magical world, we might have the following prior:

$$P(\theta) = \begin{cases} 0 & \theta \in A \\ m(I)/m(A) & \theta \in I - A \\ 0 & else \end{cases}$$

Where $m$ is the Lebesgue measure, $I$ is the unit interval, and $A$ is a set constructed with this clever method by Rudin.

We get some very strange behavior from this situation. Notably, there is an $m(A)$ chance that our MLE of $\theta$ is impossible. If we construct $A$ so that $m(A)$ is very close to 1, then the MLE of $\theta$ is almost certainly going to be bad in the sense that it will be impossible.

However, we don't live in this weird world. We live in the real world. Generally, when we pick up a coin, a prior for heads that is heavily weighted near $50\%$ is not unreasonable. At the very least, a continuous prior is almost certainly a good assumption. There is no mathematical necessity that our prior be continuous everywhere or anywhere, but we live in the real world, and the real world is a very special world out of the set of all mathematically feasible worlds. If $\theta_1$ is close to $\theta_2$ in the real world, then we anticipate that $\theta_1$ is nearly as likely as $\theta_2$ to be the correct proportion of heads. The fact that our world is a sane world is very convenient for scientists, who depend on this in order to estimate e.g. the likelihood that some coin will turn up heads. In short, priors in our world tend to be well-behaved, and this constraint along with the constraint discussed in the next section, means that the MLE is generally a likely one in our posterior distribution.

Extraordinary claims require extraordinary evidence

To illustrate this, consider Fisher's tea tasting lady. The tea tasting lady claims that she has skill at determining whether the tea or milk has been poured into the cup first. To test this, we design an experiment in which we randomize the order in which tea and milk are added to some cups of tea, and then we decide to choose the percent difference in the fraction of times she was correct and 0.5 (random guessing) as the MLE of her relative skill at tea tasting. If we pour 5 cups of tea for her to taste, then we are guaranteed to measure at least a 20% tea tasting skill, and it is not unlikely that we measure a 60% or 100% tea tasting skill.

However, we reflect briefly upon this experiment that we have designed, and it is clear that this is a terrible experiment. This is because we a priori judge this lady's claim to be nuts... there's just no way she can tell whether we poured the tea or the milk into the cup first. In other words, our prior is extremely skewed in this situation, so that our MLE is not very good in the sense that it is improbable given our prior.

As good scientists, however, we were not fooled by this, because we know that extraordinary claims require extraordinary evidence. If this lady really, really, for realsies can taste whether or not the tea was first, we need her to taste not only 5 cups, but 5000 cups! Of course, as the amount of evidence grows, the evidence overwhelms our skewed prior, and the MLE approaches the Bayesian estimate.

To sum up

In conclusion, since our world is sane, and since good scientists realize that extraordinary claims require extraordinary evidence, then generally when we compute a maximum likelihood estimate (and are inclined to take it seriously), it is not far from the maximum posterior estimate. This is because priors for problems that we test are generally very boring. They're not extremely skewed, they are continuous, and mostly differentiable, and don't tend to conflict with reality to any large degree. Thus, our MLE is usually quite likely under our prior. If the value is likely in the prior, and the evidence also supports the value, then it will be very likely in the posterior. Thus, the MLE and the MAP estimates tend not to be so different in real world problems. Of course, there is no guarantee that this is the case, but it is a convenient property of the sane world in which we live.

The question was about the quality of a certain approximation, and your answer does not touch upon that at all. For a paper-length treatment of Laplace approximation in predictive likelihood, see https://www.jstor.org/stable/2336208?seq=1#metadata_info_tab_contents. Also https://scholar.google.com/scholar?oi=gsb40&q=predictive%20likelihood%20with%20laplace%20approximation&lookup=0&hl=en — kjetil b halvorsen, Feb 13 '20 at 13:53
The second half of my answer relates to the size of the second derivative, assuming that it even exists. — Him, Feb 13 '20 at 14:39
@kjetilbhalvorsen Sure it does. Your answer makes an assumption that the prior is differentiable (twice!). My answer addresses why this is a reasonable assumption. — Him, Feb 13 '20 at 14:41
Then maybe you could substantiate your answer with some actual examples/calculations? — kjetil b halvorsen, Feb 13 '20 at 14:41
@kjetilbhalvorsen, the same could be argued for your answer, that it should be substantiated with an explanation of what the symbolic calculations have to do with the real world. :) — Him, Feb 13 '20 at 14:43
I have specified at the beginning that I only meant this as a philosophical addendum to your very good rigorous answer. — Him, Feb 13 '20 at 14:44
You could have a look at https://stats.stackexchange.com/questions/304113/failure-of-maximum-likelihood-estimation — kjetil b halvorsen, Feb 13 '20 at 15:47

Yet another "Bayesian vs Maximum Likelihood" question

2 Answers2

The world is sane

Extraordinary claims require extraordinary evidence

To sum up