Which one is better maximum likelihood or marginal likelihood and why?

Question

While performing regression if we go by the definition from: What is the difference between a partial likelihood, profile likelihood and marginal likelihood?

that, Maximum Likelihood
Find β and θ that maximizes L(β, θ|data).

While, Marginal Likelihood
We integrate out θ from the likelihood equation by exploiting the fact that we can identify the probability distribution of θ conditional on β.

Which is the better methodology to maximize and why?

Chris · Accepted Answer · 2015-01-14T18:18:32.827

Each of these will give different results with a different interpretation. The first finds the pair $\beta$,$\theta$ which is most probable, while the second finds the $\beta$ which is (marginally) most probable. Imagine that your distribution looks like this:

$\beta=1$ $\beta=2$
$\theta=1$ 0.0 0.2
$\theta=2$ 0.1 0.2
$\theta=3$ 0.3 0.2

Then the maximum likelihood answer is $\beta=1$ ($\theta=3$), while the maximum marginal likelihood answer is $\beta=2$ (since, marginalizing over $\theta$, $P(\beta=2)=0.6$).

I'd say that in general, the marginal likelihood is often what you want - if you really don't care about the values of the $\theta$ parameters, then you should just collapse over them. But probably in practice these methods will not yield very different results - if they do, then it may point to some underlying instability in your solution, e.g. multiple modes with different combinations of $\beta$,$\theta$ that all give similar predictions.

I did find different results for maximum/marginal likelihood methods and hence the question. I would say that the two results in my case give different interpretations but possible results. — Ankit Chiplunkar, Jan 14 '15 at 08:22

Paul · Answer 2 · 2015-05-15T11:55:06.180

6

I'm grappling with this question myself right now. Here's a result that may be helpful. Consider the linear model

$$y = X\beta + \epsilon, \quad \epsilon \sim N(0,\sigma^2)$$

where $y \in \mathbb{R}^n, \beta \in \mathbb{R}^p,$ and $\beta$ and $\sigma^2$ are the parameters of interest. The joint likelihood is

$$L(\beta,\sigma^2) = (2 \pi \sigma^2)^{-n/2} exp\left(-\frac{||y-X\beta||^2}{2\sigma^2}\right)$$

Optimizing the joint likelihood yields

$$\hat{\beta} = X^+ y$$

$$\hat{\sigma}^2 = \frac{1}{n}||r||^2$$

where $X^+$ is the pseudoinverse of $X$ and $r=y-X\hat{\beta}$ is the fit residual vector. Note that in $\hat{\sigma}^2$ we have $1/n$ instead of the familiar degrees-of-freedom corrected ratio $1/(n-p)$. This estimator is known to be biased in the finite-sample case.

Now suppose instead of optimizing over both $\beta$ and $\sigma^2$, we integrate $\beta$ out and estimate $\sigma^2$ from the resulting integrated likelihood:

$$\hat{\sigma}^2 = \text{max}_{\sigma^2} \int_{\mathbb{R}^p} L(\beta,\sigma^2) d\beta$$

Using elementary linear algebra and the Gaussian integral formula, you can show that

$$\hat{\sigma}^2 = \frac{1}{n-p} ||r||^2$$

This has the degrees-of-freedom correction which makes it unbiased and generally favored over the joint ML estimate.

From this result one might ask if there is something inherently advantageous about the integrated likelihood, but I do not know of any general results that answer that question. The consensus seems to be that integrated ML is better at accounting for uncertainty in most estimation problems. In particular, if you are estimating a quantity that depends on other parameter estimates (even implicitly), then integrating over the other parameters will better account for their uncertainties.

edited May 15 '15 at 11:55

answered May 10 '15 at 16:40

Paul

9,773
1
25
51

2

This is interesting. I am, however, a little troubled by the fact that "integrating out $\beta$" uses an invalid marginal distribution, as well as by the absence of any apparent justification for using this (improper) marginal compared to any other. What thoughts do you have about these issues? – whuber May 13 '15 at 16:30
2

@whuber I share your concerns and don't have a ready answer, but note that the likelihood being marginalized is just a posterior with a uniform improper prior on $\beta$, so I think this is related to the "objective Bayesian" approach. There one does not care when a parameter like $\beta$ has an improper prior distribution, so long as the posterior is integrable. – Paul May 13 '15 at 18:01
Actually, based on [this post](http://stats.stackexchange.com/a/628/11646) and comments therein, I think integrated ML, not marginal ML, is the right term for what we're doing here. Edited accordingly. – Paul May 15 '15 at 11:54
1

+1 I know i'm pretty late to this party but isn't integrating out fixed effects by putting an improper uniform prior on them exactly what REML does, so you've actually just obtained the REML estimate and this df correction is exactly the reason here that REML is better for smaller samples? – jld May 16 '18 at 15:06
@Chaconne yes, this post was motivated by trying to understand REML! I have (almost) no formal statistics education, so deriving this was all new to me. – Paul May 16 '18 at 15:19
nice, i came across this post because I'm doing the same thing right now :) – jld May 16 '18 at 21:49

Seeda · Answer 3 · 2015-01-14T01:15:27.287

This is usually not a matter of choice. If we are interested in the estimation of $\beta$ (e.g. when $\beta$ is a model hyperparameter and $\theta$ is a latent variable) and there is not a single value for $\theta$ and instead the distribution of $\theta$ in known, we need to integrate out $\theta$. You can think of marginal likelihood as the weighted average of the likelihood for different values of $\theta_i$ weighted by their probability density $p(\theta_i)$. Now that $\theta$ has disappeared, using training samples as $data$, you can optimize the marginal likelihood w.r.t. $\beta$.

Which one is better maximum likelihood or marginal likelihood and why?

3 Answers3

Linked