Why is MLE reasonable?

Question

I have always been convinced that maximum likelihood estimation is a very nice method to estimate the parameter of a probability distribution, since the first day I learn about it. However, when I was thinking about the intuition behind Fisher information, I ran into this What kind of information is Fisher information? and found myself a little bit confused. It says we can think Fisher information of a way to measure the curvature of the log likelihood, if we write it as $$ \begin{aligned} I(\theta^{\star})&=-[\mathbb{E}_{X\sim\theta^{\star}}\frac{\partial^2}{\partial\theta^2}\log f(X;\theta)]\mid_{\theta=\theta^{\star}} \end{aligned} $$ Adding some conditions, we may interchange integration and differentiation to get $$ \begin{aligned} I(\theta^{\star})&=-[\frac{\partial^2}{\partial\theta^2}\mathbb{E}_{X\sim\theta^{\star}}\log f(X;\theta)]\mid_{\theta=\theta^{\star}} \end{aligned} $$ Then we can interpret $I(\theta^{\star})$ as the curvature of the expected log likelihood around $\theta^{\star}$ when data actually follows $\theta^{\star}$.

So here comes my question: is the expected log likelihood $\mathbb{E}_{X\sim\theta^{\star}}\log f(X;\theta)$ (a function of $\theta$) maximized at $\theta=\theta^{\star}$? If not, I think it may weaken the "reasonability" of MLE. Is this just my needless worry? Or it's just that I get something wrong about Fisher information.

See also https://stats.stackexchange.com/questions/92097/why-maximum-likelihood-and-not-expected-likelihood/449782#449782 — kjetil b halvorsen, Sep 06 '20 at 13:18
Yes, the expected log-likelihood is maximised at the true value $\theta^*$. This is why the Kullback-Leibler divergence is always non-negative. — Xi'an, Sep 06 '20 at 15:05
Furthermore, under regularity conditions, the expected score$$\mathbb{E}_{X\sim\theta^{\star}}\left[\frac{\partial}{\partial\theta}\log f(X;\theta)\right]$$is also equal to $0$ when $\theta=\theta^*$. — Xi'an, Sep 06 '20 at 15:09
@Xi'an, I suppose the answer I'm looking for is exactly what you say about Kullback-Leibler divergence! Besides, I appreciate all those insights provided by other comments and answers. — Chris Cloud, Sep 07 '20 at 02:33

score 1 · Accepted Answer · answered Sep 06 '20 at 15:05

As an illustration, consider trying to estimate binomial $p$ with a known number $n$ of Bernoulli trials of which $x$ turn out to be successes. First, use small $n = 10,$ so that the MLE $\hat p = x/n$ may not be very accurate.

The likelihood function is the PDF considered as a function of $p,$ for observed data. Let's use R to plot the likelihood function for $x =4.$ likelihood function:

x = 4;  n = 10;  p=seq(.1, .9, by = 0.001)
like = dbinom(x, n, p)
plot(p, like, type="l", lwd=2)
abline(h=0, col="green2")

mle = mean(p[like == max(like)])  #'mean' in case of ties with discrete p
mle
[1] 0.4
abline(v = mle, col="orange")

It seems clear that the likelihood function attains its maximum value at $\hat p = 4/10 = 0.4.$ Also, the curvature of the likelihood function at $\hat p$ is relatively gentle its maximum, so so we cannot expect the MLE to be extremely accurate.

Of course, values other than $x = 4$ can occur and a Jeffries 95% interval estimate of $p$ is $(0.153, 0.696).$

qbeta(c(.025,.975), 4.5, 6.5)
[1] 0.1530671 0.6963205

By contrast, if $n = 100,$ then the maximum if much more precisely determined.

x = 42;  n = 100;  p=seq(.1, .9, by = 0.001)
like = dbinom(x, n, p)
plot(p, like, type="l", lwd=2)
abline(h=0, col="green2")

mle = mean(p[like == max(like)])  #'mean' in case of ties with discrete p
mle
[1] 0.42
abline(v = mle, col="orange")

Here he Jeffries interval estimate is $(0.317, 0.508).$

qbeta(c(.025,.975), 41.5, 59.5)
[1] 0.3172977 0.5078283

For the best estimation we need for the curvature of the likelihood curve to be as tight, on average, as possible at the value of the MLE.

Why is MLE reasonable?

1 Answers1