8

In an MLE setting with probability density function $f(X, \theta)$, the (expected) Fisher information is usually defined as the covariance matrix of the fisher score, i.e. $$ I(\theta) = E_\theta \left( \frac{\partial \log f(X; \theta)}{\partial \theta} \frac{\partial \log f(X; \theta)}{\partial \theta^T}\right). $$ Under the right regularity conditions, this is equivalent to $$ I(\theta) = -E_{\theta}\left(\frac{\partial^2 \log f(X; \theta)}{\partial \theta^2} \right). $$

However, the observed Fisher information is always given as $$ J(\theta) = -\frac{\partial^2 \log f(x; \theta)}{\partial \theta^2}. $$

Why is this the case? Why not consider $$ \tilde{J}(\theta) = \frac{\partial \log f(x; \theta)}{\partial \theta} \frac{\partial \log f(x; \theta)}{\partial \theta^T}. $$

This answer and this one say the observed Fisher information is a consistent estimator of the expected Fisher information.

This leads me to the question summarized in the title, specifically:

  • Why is the observed information always defined as the Hessian (analogous to the second definition of expected Fisher information above) and not using the gradient (as in the first definition)?
  • Is $\tilde{J}$ also a consistent estimator of $I$?
  • Why and in what sense is $J$ 'better' than $\tilde{J}$ when using it in practice? E.g. as basis for constructing confidence intervals.

Edit: I've discovered that $\tilde{J}$ is sometimes called the empirical Fisher information (McLachlan and Krishnan, 1997, Section 4.3). Still, I haven't found reasoning as to why this is inferior to $J$.

flhp
  • 81
  • 4

1 Answers1

6

I find the literature in MLE a bit fuzzy with nomenclature here, so I might have some stuff off, and I will try to stick to the nomenclature you introduced.

We have the observed Fisher information:

$$\left[\mathcal {J}(\theta)\right]_{ij} = -\left(\frac{\partial^2 \log f}{\partial \theta_i \partial \theta_j}\right)$$

And the empirical Fisher information:

$$\left[\mathcal {\tilde J}(\theta)\right]_{ij} = \left(\frac{\partial \log f}{\partial \theta_i}\right)\left(\frac{\partial \log f}{\partial \theta_j}\right)$$

And it can be shown that with regularity (basically differentiability) conditions (see https://stats.stackexchange.com/a/101530/60613):

$$\left[\mathcal I(\theta)\right]_{ij} = E\left[\left[\mathcal J(\theta)\right]_{ij}\right] = E\left[\left[\mathcal {\tilde J}(\theta)\right]_{ij}\right]$$

So, why not use $\mathcal {\tilde J}$ instead of $\mathcal J$? Well, we actually use both.

The distinction is in that, using $\mathcal {\tilde J}$ (expected Hessian) for MLE we are doing IWLS (Fisher scoring), while $\mathcal {J}$ (observed Hessian) results in Newton-Raphson. $\tilde {\mathcal J}$ is guaranteed positive definite for non-overparametrized loglikelihoods (since you have more data than parameters, the covariance is full rank, see Why is the Fisher Information matrix positive semidefinite?), and the procedure benefits from that. ${\mathcal J}$ does not enjoy of such benefits.

If we are performing MLE on the canonical parameter of a distribution in the exponential family, then both are actually identical.

Firebug
  • 15,262
  • 5
  • 60
  • 127
  • What a useful answer, thanks a lot :), I'd like to add this link to a a great post that explains why $E\left[\left[\mathcal J(\theta)\right]_{ij}\right] = E\left[\left[\mathcal {\tilde J}(\theta)\right]_{ij}\right]$ -> http://mark.reid.name/blog/fisher-information-and-log-likelihood.html – Javier TG Feb 27 '22 at 00:47