26

Why is it so common to obtain maximum likelihood estimates of parameters, but you virtually never hear about expected likelihood parameter estimates (i.e., based on the expected value rather than the mode of a likelihood function)? Is this primarily for historical reasons, or for more substantive technical or theoretical reasons?

Would there be significant advantages and/or disadvantages to using expected likelihood estimates rather than maximum likelihood estimates?

Are there some areas in which expected likelihood estimates are routinely used?

Jake Westfall
  • 11,539
  • 2
  • 48
  • 96
  • 9
    Expected value with respect to what probability distribution? ML is usually applied in non-Bayesian analyses where (a) the data are given (and fixed) and (b) the parameters are treated as (unknown) constants: there are no random variables at all. – whuber Apr 01 '14 at 15:43
  • @whuber: See my answer: expected with respect to the assumed model? – kjetil b halvorsen Feb 24 '20 at 14:39

4 Answers4

17

The method proposed (after normalizing the likelihood to be a density) is equivalent to estimating the parameters using a flat prior for all the parameters in the model and using the mean of the posterior distribution as your estimator. There are cases where using a flat prior can get you into trouble because you don't end up with a proper posterior distribution so I don't know how you would rectify that situation here.

Staying in a frequentist context, though, the method doesn't make much sense since the likelihood doesn't constitute a probability density in most contexts and there is nothing random left so taking an expectation doesn't make much sense. Now we can just formalize this as an operation we apply to the likelihood after the fact to obtain an estimate but I'm not sure what the frequentist properties of this estimator would look like (in the cases where the estimate actually exists).

Advantages:

  • This can provide an estimate in some cases where the MLE doesn't actually exist.
  • If you're not stubborn it can move you into a Bayesian setting (and that would probably be the natural way to do inference with this type of estimate). Ok so depending on your views this may not be an advantage - but it is to me.

Disadvantages:

  • This isn't guaranteed to exist either.
  • If we don't have a convex parameter space the estimate may not be a valid value for the parameter.
  • The process isn't invariant to reparameterization. Since the process is equivalent to putting a flat prior on your parameters it makes a difference what those parameters are (are we talking about using $\sigma$ as the parameter or are we using $\sigma^2$)
Dason
  • 2,092
  • 20
  • 19
  • 7
    +1 One *huge* problem with assuming a uniform distribution of the parameters is that ML problems are often reformulated by exploiting the invariance of their solutions to reparameterization: however, that would change the prior distribution on the parameters. Thus taking an "expectation" as if the parameters have a uniform distribution is an arbitrary artifact and can lead to mistaken and meaningless results. – whuber Apr 01 '14 at 16:46
  • 1
    Good point! I was going to mention that as well but forgot to bring it up while typing up the rest. – Dason Apr 01 '14 at 16:47
  • For the record, maximum likelihood isn't invariant to reparametrization either. – Neil G Apr 03 '14 at 06:40
  • 1
    @NeilG Yes it is? Maybe we're referring to different ideas though. What do you mean when you say that? – Dason Apr 03 '14 at 11:43
  • Perhaps I've made a mistake, but suppose you have a parameter that represents a probability $p \in [0,1]$. The data induces a Beta-distributed likelihood on it with parameters $\alpha=\beta=2$. If instead you had parametrized your model using odds $o \in [0, \infty)$, the same data would induce a Beta-prime likelihood with parameters $\alpha=\beta=2$. In the first case, the mode is $\frac12$; in the second case, the mode is $\frac13$, which corresponds to a probability of $\frac14$. – Neil G Apr 03 '14 at 22:10
  • I'm not really understanding your example - what exactly are the parameters you're estimating? Anyhow this is what is typically meant when people say MLEs are invariant: http://en.wikipedia.org/wiki/Maximum_likelihood#Functional_invariance – Dason Apr 04 '14 at 02:54
  • You're estimating $p$ (or $o$). The data is Bernoulli $x_i \in \{0,1\}$ consisting of two data points $0$ and $1$. We're starting with a uniform prior on $p$ (Beta(1,1)), which is a Beta-prime(1,1) prior on $o$. If you have some induced likelihood over the parameter $\ell(\theta)$, it looks like transforming $\theta$ should be able to change the maximum since you can stretch out places with high likelihood and contract other places, which can change the mode of the likelihood. – Neil G Apr 04 '14 at 04:17
  • @NeilG Have you tried working that problem out? If you do the math you can show that if you parameterize using the odds and find the maximum your MLE for the odds is phat/(1-phat) where phat=x/n. So yes you reparameterized the problem using a function of p instead of p as your parameter and it came out that the MLE for the function of p was that function applied to the MLE of p. Your example shows exactly what I've been saying. – Dason Apr 04 '14 at 14:39
  • You're right; thanks for being patient. I was confusing MAP with MLE. – Neil G Apr 04 '14 at 21:29
13

One reason is that maximum likelihood estimation is easier: you set the derivative of the likelihood w.r.t. the parameters to zero and solve for the parameters. Taking an expectation means integrating the likelihood times each parameter.

Another reason is that with exponential families, maximum likelihood estimation corresponds to taking an expectation. For example, the maximum likelihood normal distribution fitting data points $\{x_i\}$ has mean $\mu=E(x)$ and second moment $\chi=E(x^2)$.

In some cases, the maximum likelihood parameter is the same as the expected likelihood parameter. For example, the expected likelihood mean of the normal distribution above is the same as the maximum likelihood because the prior on the mean is normal, and the mode and mean of a normal distribution coincide. Of course that won't be true for the other parameter (however you parametrize it).

I think the most important reason is probably why do you want an expectation of the parameters? Usually, you are learning a model and the parameter values is all you want. If you're going to return a single value, isn't the maximum likelihood the best you can return?

Neil G
  • 13,633
  • 3
  • 41
  • 84
  • 6
    With respect to your last line: Maybe - maybe not. It depends on your loss function. I just toyed with Jake's idea and it seems like for the case with X ~ Unif(0, theta) that max(X)*(n-1)/(n-2), which is what Jake's method gives, has a better MSE than max(X) which is the MLE (at least simulations imply this when n >= 5). Obviously the Unif(0, theta) example isn't typical but it does show that there are other plausible methods to obtain estimators. – Dason Apr 01 '14 at 15:22
  • 6
    @Dason One standard (and powerful) *frequentist* technique for finding good (*i.e.*, admissible) estimators is to compute Bayes estimators for various priors. (See, *e.g.*, Lehmann's book on point estimation.) You have just rediscovered one such estimator. – whuber Apr 01 '14 at 16:50
  • Thanks for your answer Neil! You say that obtaining the parameter estimates via differentiation is easier compared to integration, and I can certainly see how this would be true for simple problems (e.g., pen-and-paper level or not too far beyond). But for much more complicated problems where we have to rely on numerical methods, might not it be actually easier to use integration? In practice finding the MLE can amount to quite a difficult optimization problem. Couldn't numerically approximating the integral actually be computationally easier? Or is that unlikely to be true in most cases? – Jake Westfall Apr 02 '14 at 19:43
  • @JakeWestfall: How are you going to take an expectation over the parameter space using numerical methods? In a complicated model space with a huge parameter space, you can't integrate over the whole thing evaluating the probability of each model (parameter setting). You are typically going to run EM for which the parameter estimation happens in the M-step so that each parameter is one of the "simple problems" as you say, and for which maximum likelihood parameters are straightforward expectations of sufficient statistics. – Neil G Apr 03 '14 at 05:06
  • @NeilG Well, Dason points out that the method I am discussing is (after normalization) equivalent to Bayesian estimation with a flat prior and then using the posterior mean as the estimate. So in response to "How are you going to take an expectation over the parameter space using numerical methods?" I guess I was thinking we could use one of these methods: http://www.bayesian-inference.com/numericalapproximation Any thoughts on this? – Jake Westfall Apr 03 '14 at 06:26
  • @JakeWestfall: I didn't completely understand Dason's answer so I don't want to comment on it. I also don't see what the inference algorithms listed on that page have to do with the learning algorithm you are proposing. Inference is choosing configurations given inputs and fixed model parameters. Learning is choosing parameters given training examples and a fixed loss functional. – Neil G Apr 03 '14 at 07:10
2

This approach exists and it is called Minimum Contrast Estimator. The example of related paper (and see other references from inside) https://arxiv.org/abs/0901.0655

2

There is an interesting paper proposing to maximize not the observed likelihood, but the expected likelihood Expected Maximum Log Likelihood Estimation. In many examples this gives the same results as MLE, but in some examples where it is different, it as arguably better, or at least different in an interesting way.

Note that this is a pure frequentist idea, so is different from what is discussed in the other answers, where it is assumed that expectation is of the parameter itself, so some (quasi-)bayesian idea.

One example: Take the usual multiple linear regression model, with normal errors. Then the log-likelihood function is (up to a constant): $$ \log L(\beta) = -\frac{n}{2}\log \sigma^2 - \frac1{2\sigma^2} (Y-X\beta)^T (Y-X\beta) $$ which can be written (with $\hat{\beta}=(X^TX)^{-1} X^T Y$, the usual least-squares estimator of $\beta$) $$ \left[ -\frac{n}{2\sigma^2}+\frac1{2\sigma^4}(Y-X\hat{\beta})^T(Y-X\hat{\beta})\right]+\frac1{2\sigma^4}(\hat{\beta}-\beta)^T X^T X(\hat{\beta}-\beta) $$ The second term here is $\frac12 (\frac{\partial \log L}{\partial \beta})^T (X^T X)^{-1} \frac{\partial \log L}{\partial \beta})$ with expectation $\frac{p}{2\sigma^2}$, so the estimating equation for $\sigma^2$ becomes $$ -\frac{n}{2\sigma^2}+\frac1{2\sigma^4}(Y-X\hat{\beta})^T (Y-X\hat{\beta}) + \frac{p}{2\sigma^4} $$ where $p$ is the number of columns in $X$. The solution is the usual bias-corrected estimator, with denominator $n-p$, and not $n$, as for the MLE.

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467