21

Consider a vector of parameters $(\theta_1, \theta_2)$, with $\theta_1$ the parameter of interest, and $\theta_2$ a nuisance parameter.

If $L(\theta_1, \theta_2 ; x)$ is the likelihood constructed from the data $x$, the profile likelihood for $\theta_1$ is defined as $L_P(\theta_1 ; x) = L(\theta_1, \hat{\theta}_2(\theta_1) ; x)$ where $ \hat{\theta}_2(\theta_1)$ is the MLE of $\theta_2$ for a fixed value of $\theta_1$.

$\bullet$ Maximising the profile likelihood with respect to $\theta_1$ leads to same estimate $\hat{\theta}_1$ as the one obtained by maximising the likelihood simultaneously with respect to $\theta_1$ and $\theta_2$.

$\bullet$ I think the standard deviation of $\hat{\theta}_1$ may also be estimated from the second derivative of the profile likelihood.

$\bullet$ The likelihood statistic for $H_0: \theta_1 = \theta_0$ can be written in terms of the profile likelihood: $LR = 2 \log( \tfrac{L_P(\hat{\theta}_1 ; x)}{L_P(\theta_0 ; x)})$.

So, it seems that the profile likelihood can be used exactly as if it was a genuine likelihood. Is it really the case ? What are the main drawbacks of that approach ? And what about the 'rumor' that the estimator obtained from the profile likelihood is biased (edit: even asymptotically) ?

ocram
  • 19,898
  • 5
  • 76
  • 77
  • 2
    just a note, the estimators from the likelihood can also be biased, the classical example is likelihood variance estimate for normal sample. – mpiktas Apr 22 '11 at 07:19
  • @mpiktas: Thanks for your comment. Indeed, the classical mle can also be biased. I will edit the question to make things clearer. – ocram Apr 22 '11 at 07:23
  • what is the assymptotic bias? Are you talking about non-consistent estimators? – mpiktas Apr 22 '11 at 10:02
  • @mpiktas: Yes, this is what I should have said... – ocram Apr 22 '11 at 12:40

2 Answers2

15

The estimate of $\theta_1$ from the profile likelihood is just the MLE. Maximizing with respect to $\theta_2$ for each possible $\theta_1$ and then maximizing with respect to $\theta_1$ is the same as maximizing with respect to $(\theta_1, \theta_2)$ jointly.

The key weakness is that, if you base your estimate of the SE of $\hat{\theta}_1$ on the curvature of the profile likelihood, you are not fully accounting for the uncertainty in $\theta_2$.

McCullagh and Nelder, Generalized linear models, 2nd edition, has a short section on profile likelihood (Sec 7.2.4, pgs 254-255). They say:

[A]pproximate confidence sets may be obtained in the usual way....such confidence intervals are often satisfactory if [the dimension of $\theta_2$] is small in relation to the total Fisher information, but are liable to be misleading otherwise.... Unfortunately [the profile log likelihood] is not a log likelihood function in the usual sense. Most obviously, its derivative does not have zero mean, a property that is essential for estimating equations.

Karl
  • 5,957
  • 18
  • 34
  • Thank you very much for your answer. Before accepting it, let me please ask something more. What are the implications of $E \frac{\partial l_P(\theta_1)}{\partial \theta_1} \neq 0$? – ocram Aug 22 '11 at 19:23
  • Interesting question, though it required a trip to the bookshelf (which I should have done anyway). I've added a bit to my answer on this point. – Karl Aug 22 '11 at 19:58
  • Thank you very much for the edit. It is said that the property (the score evaluated at the true parameter value has mean zero) is essential for estimating equations. But though the profile log likelihood does not fulfill that property it does produce the MLE. Is there something I miss? – ocram Aug 23 '11 at 06:50
  • That property is not necessary for providing the MLE. – Karl Aug 23 '11 at 11:39
  • @ocram We set first derivative equals 0 for finding MLE. Then, how the property is not used for providing the MLE? Where do we specifically use the property? – time Dec 31 '21 at 07:29
0

The major drawback is that the profile likelihood is completely meaningless.

The profile likelihood should be viewed as intermediate quantity that facilitates applications of asymptotic approximations (Wilks etc) for the purpose of constructing a confidence intervals and regions.

By itself, however, it doesn't have any coherent meaning or interpretation that I'm aware of.

innisfree
  • 1,124
  • 6
  • 23