1

I'm trying to put confidence intervals on parameters fitted through MLE through the inversion of the observed Fisher information matrix. More specifically, I define the observed FIM as:

$$ J_{n}(\hat{\theta{}_{n}}) = -\sum^{n}_{i=1}\frac{\partial{}^{2}}{\partial{}\hat{\theta{}_{n}^{2}}}\log{f_{\theta{}_{n}}(X_{i})} $$

Where $\log{f_{\theta{}}(X)}$ is the MLE term. I've seen in [some sources][1] that the confidence intervals for specific parameters can be estimated as (Equation 3.15 in source):

$$ \hat{\theta{}}_{nj} \pm{} c\sqrt{J_{n}(\hat{\theta_{n}})^{-1}_{jj}} $$

My confusion comes from accounting for the total number of data points. Sometimes it seems like these values are scaled by the number of data points, while others [seem to cancel out][2] (below equation 2 on page 2). A definitive answer and source would be really helpful.

Edit:

So to put it in more explicit, applied terms (might be conflating jargon throughout, so bear with me), we have an objective function like:

$$ l_{\theta{}}=\sum_{i=1}^{n}(-\frac{1}{2}\ln(2\pi{})-\frac{1}{2}\ln{\sigma_{i}^{2}}-\frac{1}{2}(\frac{\hat{y}_{i}(\theta{})-y_{i}}{\sigma_{i}})^{2}) $$

where $\theta{}$ is the vector of parameters being fitted, $n$ are the total number of data points in the set, $\sigma{}$ is the standard deviation of the noise around the experimental data, $\hat{y}_{i}(\theta{})$ is the model predicted outlet and $y$ is the measured outlet.

My goal is to minimize a function like this and to then have a covariance matrix around the fitted parameters (which should easily be converted into individual confidence intervals). Once I've reached the minimum, I can calculate the Hessian as:

$$ H_{\theta{}} = \frac{\partial{}^{2}l_{\theta{}}}{\partial{}\theta{}^{2}} $$

I've seen the covariance calculated through $H_{\theta{}}$ in two different ways: through the observed FIM and the expected FIM. If I were to choose the observed FIM, would it just be:

$$ V_{\theta} = H_{\theta}^{-1} $$

where $V_{\theta}$ is the covariance matrix. Would I be missing a scaling factor of $n$?

Then if I were to use the expected FIM, would I simply use:

$$ V_{\theta} = (E({H_{\theta}}))^{-1} $$

where $E$ is meant to be the expectation. [1]: https://www.stat.umn.edu/geyer/s06/5102/notes/fish.pdf [2]: https://arxiv.org/pdf/2107.04620.pdf

  • 1
    I suggest not labeling your objective function as $f$ since this notation is used for the probability density function. Your objective function appears to be the log likelihood, which you can label as $\ell$. In your edit I do not see a $\theta$ on the right hand side of the objective function. I also see you have two $y_i$'s, one of which I expected to be $\theta$, i.e. $(y_i-\theta)$. I also see you have an $i$ subscript on the $\sigma$ parameter. In this setting it would be best to not have these parameters be subject-specific. – Geoffrey Johnson Dec 14 '21 at 18:54
  • 1
    I updated my answer. – Geoffrey Johnson Dec 14 '21 at 19:18
  • Just edited the response again, where $\hat{y}_{i}(\theta{})$ is meant to be the outlet of the model we're fitting w.r.t. the parameters $\theta{}$. So $y$ isn't the parameters we want to identify, but the output of the model we're trying to fit and quantify the uncertainty of $\theta{}$ for. – auf-wiedersehen-yall Dec 14 '21 at 19:23
  • 1
    You've lost me with "outlet of the model." I updated my response further with examples for the Poisson and normal distributions. – Geoffrey Johnson Dec 14 '21 at 19:37
  • 1
    I highly recommend the book *In All Likelihood* by Yudi Pawitan. Even his notation is not great, but it is still a phenomenal book. – Geoffrey Johnson Dec 14 '21 at 19:39
  • 1
    Typically we do not put a hat on $y$ since it is understood this is the result of random sampling. We also typically have $\theta$ explicitly in the model since it represents some feature of the data generative process such as the mean. – Geoffrey Johnson Dec 14 '21 at 19:41
  • 1
    I gave an explicit example using the Poisson distribution. – Geoffrey Johnson Dec 14 '21 at 20:00
  • Yeah, might not be able to explain it properly. I've got a system of PDEs describing a physical process I'm trying to fit parameters to. I've looked at this example frequently to try to understand it (https://stats.stackexchange.com/questions/68080/basic-question-about-fisher-information-matrix-and-relationship-to-hessian-and-s?rq=1), but still can't figure out the "number of data" points scaling question. I'll probably buy a copy of the book soon, but the answer should be available in some document online. – auf-wiedersehen-yall Dec 14 '21 at 20:04
  • 1
    Can you rely on a software package like glm in R or Proc Genmod in SAS to do the heavy lifting? – Geoffrey Johnson Dec 14 '21 at 20:07
  • I don't think so (not easily, anyway). Either way, I'll keep reading into it and playing around with it. There's a chance I'll ask a follow up in the next week :) Thanks for all the help! – auf-wiedersehen-yall Dec 14 '21 at 21:28
  • 1
    Let us [continue this discussion in chat](https://chat.stackexchange.com/rooms/132337/discussion-between-geoffrey-johnson-and-auf-wiedersehen-yall). – Geoffrey Johnson Dec 14 '21 at 21:31

1 Answers1

3

I find it easiest to view this as the inversion of a Wald test using the model-based standard error estimate. All of the data points must be used. Sometimes the heavy notation with subscripts, or lacking subscripts, can be confusing.

In order to discuss convergence in distribution the estimator needs to be multiplied by $\sqrt{n}$. The limiting distribution has variance equal to the data generative process, i.e. $\sqrt{n}\{\hat{\theta}-\theta_0\}\overset{D}{\rightarrow}N(0,I^{-1}_1(\theta_0))$. This doesn't mean the sample size is irrelevant for the variance of the MLE. Quite the contrary. It is equivalent to writing $\hat{\theta}\overset{approx}{\sim}N(\theta_0,I^{-1}_n(\theta_0))$ or $\frac{\{\hat{\theta}-\theta_0\}}{I^{-1/2}_n(\theta_0)}\overset{D}{\rightarrow}N(0,1)$.

This notation works if the model has a single scalar parameter $\theta$. In models that have two or more parameters, bold faced vector $\boldsymbol{\theta}$, you might see notation like $I_{11}(\boldsymbol{\theta})$ referring to the 1,1 element of the Fisher information matrix $I_1(\boldsymbol{\theta})$ for a single datum. The same $I_{11}(\boldsymbol{\theta})$ notation might also refer to the 1,1 element of the Fisher information matrix $I_n(\boldsymbol{\theta})$ for all $n$ data points. Unfortunately, there is no standard notation.

Just remember that when it comes time to construct a Wald confidence interval for a scalar $\theta$ it should have the form

$$\hat{\theta}\pm z_{1-\alpha/2}\cdot\hat{\text{se}} $$

where $\hat{\text{se}}$ may be a function of $\hat{\theta}$ and most definitely has $\sqrt{n}$ in its denominator. For example, in a $Y_1,...,Y_n\sim \text{Poisson}(\theta)$ model:

\begin{eqnarray} f(y)&=&\frac{\theta^y\cdot e^{-\theta} }{y!}\\ &&\\ L(\theta)&\propto&\prod \theta^{y_i}\cdot e^{-\theta} \\ &&\\ \ell(\theta)&=&\sum \text{log}\Big(\theta^{y_i}\cdot e^{-\theta} \Big)=\sum \{y_i\text{log}(\theta)-\theta\}\\ &&\\ \frac{d\ell(\theta)}{d\theta}&=& \sum \{y_i/\theta-1\} \\ &&\\ \frac{d^2\ell(\theta)}{(d\theta)^2}&=& \sum \{-y_i/\theta^2\} &&\\ I_n(\theta)&=&nI_{11}(\theta)=-E\Big[\frac{d^2\ell(\theta)}{(d\theta)^2}\Big]=n/\theta &&\\ \text{Var}(\hat{\theta})&=&I^{-1}_n(\theta)=\theta/n\\ &&\\ \hat{\text{Var}}(\hat{\theta})&=&I^{-1}_n(\hat{\theta})=\hat{\theta}/n \end{eqnarray}

The Wald CI for the mean $\theta$ using the model-based standard error estimator would be

$$\hat{\theta}\pm z_{1-\alpha/2}\cdot\sqrt{\hat{\theta}/n} $$

where $\hat{\theta}=\bar{y}$. For a $Y_1,...,Y_n\sim N(\theta,\sigma^2)$ model the Wald CI for the mean $\theta$ using the model-based standard error estimator would be

$$\hat{\theta}\pm z_{1-\alpha/2}\cdot\sqrt{\hat{\sigma^2}/n} $$

where $\hat{\theta}=\bar{y}$ and $\hat{\sigma^2}=n^{-1}\sum (y_i-\bar{y})^2$. Let me know if I have made any mistakes.

Geoffrey Johnson
  • 2,460
  • 3
  • 12
  • Yeah, I'm still confused (I don't have much of a background in statistics though, so definitely not a problem with the explanation :) ). I guess I've got three little questions in response that might make it clearer: 1). What is the difference between I_{1} and l_{n}? 2). Is the I_{n} term the Hessian divided by n still? 3). Would you be multiplying I_{n}^-1 by sqrt(n) to get the covariance matrix in practice? – auf-wiedersehen-yall Dec 14 '21 at 04:19
  • 1
    Good questions. 1) $I_1(\theta_0)$ is the Fisher information based on a single datum, i.e. a sample of size $n=1$, while $I_n(\theta_0)$ is the Fisher information based on the full sample using all $n>1$ data points. 2) Based on the first question $I_n(\theta_0)$ is $I_1(\theta_0)$ *multiplied* by $n$ since it is the inversion of the variance of the MLE. 3) Based on 1) and 2) there is no need to multiply $I^{-1}_n(\theta_0)$ by any other terms. For a vector $\hat{\theta}$, $I^{-1}_n(\theta_0)$ is the covariance matrix. – Geoffrey Johnson Dec 14 '21 at 13:04
  • 1
    To obtain the Fisher information: 1) construct the Hessian matrix, 2) take the expectation of this matrix, 3) multiply this by $-1$. Ultimately there should be no terms involving data. It should involve only parameters and $n$. When it comes time to estimate the variance and covariance terms you can replace each parameter with its mle. – Geoffrey Johnson Dec 14 '21 at 13:16
  • 1
    In settings where $E[\hat{\theta}]\ne \theta$, using $I_n(\hat{\theta})$ could lead to a slightly different covariance matrix estimator compared to $J_n$ in your question, but both are consistent estimators of the covariance matrix. – Geoffrey Johnson Dec 14 '21 at 13:21
  • I'm learning a lot from these responses and think I'm getting close to having a full understanding, but I'm still a little uncertain. Went ahead and edited my question with a less abstract question. Thanks for all the help! – auf-wiedersehen-yall Dec 14 '21 at 15:34