Why is the observed Fisher information defined as the Hessian of the log-likelihood?

Question

In an MLE setting with probability density function $f(X, \theta)$, the (expected) Fisher information is usually defined as the covariance matrix of the fisher score, i.e. $$ I(\theta) = E_\theta \left( \frac{\partial \log f(X; \theta)}{\partial \theta} \frac{\partial \log f(X; \theta)}{\partial \theta^T}\right). $$ Under the right regularity conditions, this is equivalent to $$ I(\theta) = -E_{\theta}\left(\frac{\partial^2 \log f(X; \theta)}{\partial \theta^2} \right). $$

However, the observed Fisher information is always given as $$ J(\theta) = -\frac{\partial^2 \log f(x; \theta)}{\partial \theta^2}. $$

Why is this the case? Why not consider $$ \tilde{J}(\theta) = \frac{\partial \log f(x; \theta)}{\partial \theta} \frac{\partial \log f(x; \theta)}{\partial \theta^T}. $$

This answer and this one say the observed Fisher information is a consistent estimator of the expected Fisher information.

This leads me to the question summarized in the title, specifically:

Why is the observed information always defined as the Hessian (analogous to the second definition of expected Fisher information above) and not using the gradient (as in the first definition)?
Is $\tilde{J}$ also a consistent estimator of $I$?
Why and in what sense is $J$ 'better' than $\tilde{J}$ when using it in practice? E.g. as basis for constructing confidence intervals.

Edit: I've discovered that $\tilde{J}$ is sometimes called the empirical Fisher information (McLachlan and Krishnan, 1997, Section 4.3). Still, I haven't found reasoning as to why this is inferior to $J$.

Firebug · Answer 1 · 2021-04-08T21:22:11.160

I find the literature in MLE a bit fuzzy with nomenclature here, so I might have some stuff off, and I will try to stick to the nomenclature you introduced.

We have the observed Fisher information:

$$\left[\mathcal {J}(\theta)\right]_{ij} = -\left(\frac{\partial^2 \log f}{\partial \theta_i \partial \theta_j}\right)$$

And the empirical Fisher information:

$$\left[\mathcal {\tilde J}(\theta)\right]_{ij} = \left(\frac{\partial \log f}{\partial \theta_i}\right)\left(\frac{\partial \log f}{\partial \theta_j}\right)$$

And it can be shown that with regularity (basically differentiability) conditions (see https://stats.stackexchange.com/a/101530/60613):

$$\left[\mathcal I(\theta)\right]_{ij} = E\left[\left[\mathcal J(\theta)\right]_{ij}\right] = E\left[\left[\mathcal {\tilde J}(\theta)\right]_{ij}\right]$$

So, why not use $\mathcal {\tilde J}$ instead of $\mathcal J$? Well, we actually use both.

The distinction is in that, using $\mathcal {\tilde J}$ (expected Hessian) for MLE we are doing IWLS (Fisher scoring), while $\mathcal {J}$ (observed Hessian) results in Newton-Raphson. $\tilde {\mathcal J}$ is guaranteed positive definite for non-overparametrized loglikelihoods (since you have more data than parameters, the covariance is full rank, see Why is the Fisher Information matrix positive semidefinite?), and the procedure benefits from that. ${\mathcal J}$ does not enjoy of such benefits.

If we are performing MLE on the canonical parameter of a distribution in the exponential family, then both are actually identical.

What a useful answer, thanks a lot :), I'd like to add this link to a a great post that explains why $E\left[\left[\mathcal J(\theta)\right]_{ij}\right] = E\left[\left[\mathcal {\tilde J}(\theta)\right]_{ij}\right]$ -> http://mark.reid.name/blog/fisher-information-and-log-likelihood.html — Javier TG, Feb 27 '22 at 00:47

Why is the observed Fisher information defined as the Hessian of the log-likelihood?

1 Answers1

Linked