MLE: Marginal vs Full Likelihood

Question

Suppose I have a statistical model with parameters $\boldsymbol{\theta}=\{\theta_1,\theta_2,\dots,\theta_n\}$ of which only a single parameter, say $\theta_1$, is of interest to me. Suppose also that I can write down the full likelihood function for the model $\mathcal{L(\boldsymbol{\theta};\mathbf{x})}$.

Using the method of maximum likelihood I can estimate $\theta_1$ by maximising $\mathcal{L(\boldsymbol{\theta};\mathbf{x})}$ with respect to $\boldsymbol{\theta}$ to obtain $\hat{\boldsymbol{\theta}}$ and retrieving $\hat{\theta}_1 \in \hat{\boldsymbol{\theta}}$.

Alternatively I can maximise the marginal likelihood $\mathcal{L(\theta_1;\mathbf{x})} = \idotsint \mathcal{L(\boldsymbol{\theta};\mathbf{x})} \,d\theta_2 \dots d\theta_n$ which is calculated by integrating over all possible values of $\{\theta_2,\dots,\theta_n\}$.

Under what circumstances (if any) is the latter approach preferable, bearing in mind that I am only interested in the value of $\theta_1$?

Shifer · Answer 1 · 2018-09-07T15:36:08.553

The usual way of doing likelihood inference on a parameter of interest in the presence of nuisance parameters consists of using the Profile Likelihood function (see this link). In your context, the profile likelihood is:

$$\mathcal{L}_P(\theta_1;\mathbf{x}) = \max_{\theta_2,\dots,\theta_n} \mathcal{L(\boldsymbol{\theta};\mathbf{x})}.$$

The object of interest is the normalized profile likelihood, which is nothing but

$$R_P(\theta_1, {\bf x}) = \frac{\mathcal{L}_P(\theta_1;\mathbf{x})}{\mathcal{L(\boldsymbol{\widehat{\theta}};\mathbf{x})}}.$$

This function can be used to construct confidence intervals on the parameter of interest. A thorough study of the profile likelihood can be found in:

Sprott, David A. Statistical Inference in Science. Springer Science & Business Media, 2008.

In some cases, some people assign a distribution to the "nuisance parameters" ($\theta_2,\dots,\theta_n$, in your case) and integrate them out. This is hybrid between Bayesian and Classical inference, and it is called the integrated likelihood. However, this requires assuming a distribution on the nuisance parameters in order to guarantee that the integral is finite. See:

Berger, James O., Brunero Liseo, and Robert L. Wolpert. "Integrated likelihood methods for eliminating nuisance parameters." Statistical Science 14.1 (1999): 1-28.

Note that, if you do not assign a proper distribution on the nuisance parameters, there is no guarantee that the marginal/integrated likelihood function is finite. Using a distribution $\pi(\theta_2, \dots, \theta_n)$ guarantees that

$\mathcal{L(\theta_1;\mathbf{x})} = \idotsint \mathcal{L(\boldsymbol{\theta};\mathbf{x})} \pi(\theta_2, \dots, \theta_n) \,d\theta_2 \dots d\theta_n < \infty,$

by the Bayes theorem (for regular models).

Under what circumstances is maximising the marginal/integrated likelihood preferable to maximising the full likelihood? What advantages/disadvantages does each method confer when only a subset of the model parameters are of interest? — Estacionario, Sep 07 '18 at 16:57
@Estacionario The marginal likelihood depends on the choice of the distribution $\pi$. For small samples, it might be quite sensitive to this choice. The question in general is rather difficult, as you can see from the paper by Berger and others. I leave it to you to do a more proper read on the topic, which is rather large to be covered in this site. — Shifer, Sep 07 '18 at 17:00

score 2 · Answer 2 · answered Sep 07 '18 at 20:10

A general answer is tough and +1 to @Shifer, but if you're looking for one particular example when integrating outperforms profiling then you might find the Neyman-Scott "paradox" interesting.

The problem: suppose we have $\{(x_1,y_1),\dots,(x_n,y_n)\}$ where each pair $(x_i, y_i)$ is independent and $$ {x_i \choose y_i} \sim \mathcal N(\mu_i \mathbf 1, \sigma ^2 I_2). $$ Thus we have $n$ pairs of Gaussian RVs where all $2n$ RVs are independent but each pair has a different mean. The goal now is to estimate $\sigma^2$.

I'm going to do this with matrices so I don't have summations all over the place, so I'll write this as$\newcommand{\e}{\varepsilon}$ $$ z = A\mu + \e $$ where $z = (x_1, y_1, \dots, x_n, y_n)^T$, $\e \sim \mathcal N(0, \sigma^2 I_{2n})$, $\mu = (\mu_1,\dots,\mu_n)^T$, and $$ A = \left(\begin{array}{cccccc} 1 & 0 & 0 & \dots & 0 & 0\\ 1 & 0 & 0 & \dots & 0& 0\\ 0 & 1 & 0 & \dots & 0& 0\\ 0 & 1 & 0 & \dots & 0& 0\\ & & & \vdots & & \\ 0 & 0 & 0 & \dots & 0& 1 \\ 0 & 0 & 0 & \dots & 0& 1 \end{array}\right) $$

so $A$ picks out the correct mean parameter for each pair and then the error is a spherical Gaussian.

I'll use $\tau =1 / \sigma^2$ in some places to make the math (especially the derivatives) easier.

Profiling

The likelihood is $$ f(z | \mu, \sigma^2) = \left(\frac{\tau}{2\pi} \right)^n \exp\left(-\frac \tau 2 \|z- A\mu\|^2\right). $$

We'll first find the MLE of $\mu$ and plug that in to get a profiled likelihood. This is just OLS linear regression so $\hat\mu = (A^TA)^{-1}A^Tz$ in this case, and therefore the profiled log likelihood is (up to some constant) $$ \ell_p(y | \hat\mu, \tau) = -\frac \tau 2 \|z- A\hat \mu\|^2 + n\log \tau. $$

We don't actually need this, but it's worth noting that $A^TA = 2I$ so actually $\hat\mu$ is just the mean of each pair, i.e. $\hat\mu_i = \frac{x_i + y_i}{2}$.

This leads to $$ \frac{\partial \ell_p}{\partial \tau} = -\frac 1 2 \|z- A\hat \mu\|^2 + n\tau^{-1}. $$ Solving for zero, and then inverting to get the MLE of $\sigma^2$, we get $$ \hat\sigma^2 = \frac{\|z- A\hat \mu\|^2}{2n} $$ (you can take the second derivative to show this is actually a max, and that is one of the main simplifications in using $\tau$ instead of $\sigma^2$).

Let $H = A(A^TA)^{-1}A^T$ and note that $$ \|z- A\hat \mu\|^2 = \|(I-H)z\|^2 = z^T(I-H)z $$ so since $z \sim \mathcal N(A\mu, \sigma^2 I)$ we've got a Gaussian quadratic form. This means $$ E(z^T(I-H)z) = \sigma^2 \text{tr}(I-H) + \mu^TA^T(I-H)A\mu \\ = n\sigma^2 $$ (see e.g. here for a proof of this result for quadratic forms).

All together this means $$ E(\hat\sigma^2) = \frac{n\sigma^2}{2n} = \frac{\sigma^2}2 $$

so $\hat\sigma^2$ is biased, and this bias does not go away as $n\to\infty$ (i.e. it is inconsistent). That's not good.

Integrating

So we can try something else. I'm going to suppose $\mu \sim \mathcal N(0, (\tau\lambda)^{-1} I)$ (and $\mu \perp \e$) and then I'll integrate $\mu$ out and maximize the resulting integrated likelihood (so it'll be like a MAP).

I don't actually need to evaluate the integral in this case since $\mu$ and $\e$ being independent Gaussians means the marginal distribution is also Gaussian. In particular, $$ A\mu + \e \sim \mathcal N(0, (\tau\lambda)^{-1}(AA^T + \lambda I)) $$ so now I have $$ f_I(z | \tau, \lambda) = \left(\frac{\tau\lambda}{2\pi}\right)^{n} |AA^T + \lambda I|^{-1/2} \exp\left(-\frac{\tau\lambda}2 z^T(AA^T+\lambda I)^{-1}z\right). $$

I'm going to obtain my estimate by maximizing this w.r.t. $\tau$ so I'll take logs to get $$ \ell_I(z | \tau, \lambda) = n\log \tau - \frac{\tau\lambda}2 z^T(AA^T+\lambda I)^{-1}z $$ up to some constants. This leads to $$ \frac{\partial \ell_I}{\partial \tau} = \frac{n}{\tau} - \frac{\lambda}{2} z^T(AA^T+\lambda I)^{-1}z $$ so $$ \tilde \sigma^2 = \frac{\lambda}{2n}z^T(AA^T+\lambda I)^{-1}z. $$

This again is a Gaussian quadratic form although now $z \sim \mathcal N(0, (\sigma^2/\lambda)(AA^T+\lambda I))$ which means $$ E(z^T(AA^T+\lambda I)^{-1} z) = \frac{\sigma^2}\lambda \text{tr}\left[(AA^T+\lambda I)^{-1}(AA^T+\lambda I)\right] \\ = \frac{2n \sigma^2}\lambda $$ so $$ E(\tilde \sigma^2) = \frac{\lambda}{2n} \cdot \frac{2n \sigma^2}\lambda = \sigma^2 $$ so not only is this is unbiased but it is unbiased for any valid prior variance.

This example definitely can feel a little contrived but it does align with at least the intuitive idea that when there are tons of parameters, integrating with respect to a sensible prior (and I think a Gaussian prior for normal means is often sensible) can lead to better results than profiling (I call this intuitive because I think of averages as being more stable than maxima). But I was fortunate here that everything was analytically tractable for the integration and in general you won't be so lucky. Again, in summary this is a big topic with lots of complexities but hopefully this was at least interesting.

MLE: Marginal vs Full Likelihood

2 Answers2

Profiling

Integrating

Linked