26

Maximum likelihood estimation often results into biased estimators (e.g., its estimate for the sample variance is biased for the Gaussian distribution).

What then makes it so popular? Why exactly is it used so much? Also, what in particular makes it better than the alternative approach -- method of moments?

Also, I noticed that for the Gaussian, a simple scaling of the MLE estimator makes it unbiased. Why is this scaling not a standard procedure? I mean -- Why is it that after MLE computation, it is not routine to find the necessary scaling to make the estimator unbiased? The standard practice seems to be the plain computation of the MLE estimates, except of course for the well known Gaussian case where the scaling factor is well known.

Richard Hardy
  • 54,375
  • 10
  • 95
  • 219
Minaj
  • 1,201
  • 1
  • 12
  • 21
  • 11
    There are many, many alternatives to ML, not just the method of moments--which also tends to produce biased estimators, by the way. What you might want to ask instead is "why would anybody want to use an unbiased estimator?" A good way to start researching this issue is a search on [bias-variance tradeoff](http://stats.stackexchange.com/search?q=bias+variance+tradeoff). – whuber Nov 22 '15 at 16:28
  • 7
    As whuber pointed out, there is no intrinsic superiority in being unbiased. – Xi'an Nov 22 '15 at 16:38
  • 4
    I think @whuber means "why would anybody want to use a *biased* estimator?" It doesn't take much work to convince someone that an unbiased estimator may be a reasonable one. – Cliff AB Nov 22 '15 at 17:31
  • 5
    See https://en.wikipedia.org/wiki/Bias_of_an_estimator#Estimating_a_Poisson_probability for an example where the only unbiased estimator is certainly not one you'd want to use. – Scortchi - Reinstate Monica Nov 22 '15 at 19:24
  • 4
    @Cliff I intended to ask the question in its more provocative, potentially more mysterious form. Lurking behind this is the idea that there are many ways to evaluate the quality of an estimator and many of them have nothing to do with bias. From that point of view, it is most natural to ask why someone would propose an *unbiased* estimator. See glen_b's answer for more from this point of view. – whuber Nov 23 '15 at 16:44
  • 2
    @whuber: oh I see. You were suggesting he ask "why use an unbiased estimator instead of an MLE?" – Cliff AB Nov 23 '15 at 17:38
  • 2
    If you think that unbiasedness is some sort of "ideal" property there are for instance examples where the UMVUE of a positive parameter ends up being negative with positive probability. In these cases you've actually hurt your estimator by requiring that it be unbiased. – dsaxton Nov 23 '15 at 18:37

5 Answers5

20

Unbiasedness isn't necessarily especially important on its own.

Aside a very limited set of circumstances, most useful estimators are biased, however they're obtained.

If two estimators have the same variance, one can readily mount an argument for preferring an unbiased one to a biased one, but that's an unusual situation to be in (that is, you may reasonably prefer unbiasedness, ceteris paribus -- but those pesky ceteris are almost never paribus).

More typically, if you want unbiasedness you'll be adding some variance to get it, and then the question would be why would you do that?

Bias is how far the expected value of my estimator will be too high on average (with negative bias indicating too low).

When I'm considering a small sample estimator, I don't really care about that. I'm usually more interested in how far wrong my estimator will be in this instance - my typical distance from right... something like a root-mean-square error or a mean absolute error would make more sense.

So if you like low variance and low bias, asking for say a minimum mean square error estimator would make sense; these are very rarely unbiased.

Bias and unbiasedness is a useful notion to be aware of, but it's not an especially useful property to seek unless you're only comparing estimators with the same variance.

ML estimators tend to be low-variance; they're usually not minimum MSE, but they often have lower MSE than than modifying them to be unbiased (when you can do it at all) would give you.

As an example, consider estimating variance when sampling from a normal distribution $\hat{\sigma}^2_\text{MMSE} = \frac{S^2}{n+1}, \hat{\sigma}^2_\text{MLE} = \frac{S^2}{n}, \hat{\sigma}^2_\text{Unb} = \frac{S^2}{n-1}$ (indeed the MMSE for the variance always has a larger denominator than $n-1$).

Glen_b
  • 257,508
  • 32
  • 553
  • 939
  • 1
    +1. Is there any intuition for (or perhaps some theory behind) your second-before last paragraph? Why do ML estimators tend to be low-variance? Why do they often have lower MSE than the unbiased estimator? Also, I am amazed to see the expression for MMSE estimator of variance; somehow I have never encountered it before. Why is it so rarely used? And does it have anything to do with shrinkage? It seems that it is "shrunk" from unbiased towards zero, but I am confused by that as I am used to thinking about shrinkage only in the multivariate context (along the lines of James-Stein). – amoeba Nov 25 '15 at 01:16
  • 1
    @amoeba MLEs are generally functions of sufficient statistics, and at least asymptotically minimum variance unbiased, so you expect them to be low variance in large samples, typically achieving the CRLB in the limit; this is often reflected in smaller samples. $\:$ MMSE estimators *are* generally shrunk toward zero because that reduces variance (and hence a small amount of bias toward 0 introduced by a small shrinkage will typically reduce MSE). – Glen_b Nov 25 '15 at 04:16
  • @Glen_b, great answer (I keep coming back to it). Would you have an explanation or a reference for $\hat{\sigma}^2_\text{MMSE} = \frac{S^2}{n+1}$ being the minimum MSE estimator? – Richard Hardy Apr 26 '18 at 08:56
  • Also, does that imply the ML estimator of variance is not a minimum-variance estimator? Otherwise the minimum MSE estimator would be some weighted average (with positive weights) of the MLE and the unbiased estimator, but now it is outside that range. I could ask this as a separate question if you think it makes sense. – Richard Hardy Apr 26 '18 at 09:12
  • 1
    Found a whole derivation in a [Wikipedia article on MSE](https://en.wikipedia.org/wiki/Mean_squared_error#Variance), I guess that explain all of it. – Richard Hardy Apr 26 '18 at 13:02
18

Maximum likelihood estimation (MLE) yields the most likely value of the model parameters, given the model and the data at hand -- which is a pretty attractive concept. Why would you choose parameter values that make the data observed less probable when you can choose the values that make the data observed the most probable across any set of values? Would you wish to sacrifice this feature for unbiasedness? I do not say the answer is always clear, but the motivation for MLE is pretty strong and intuitive.

Also, MLE may be more widely applicable than method of moments, as far as I know. MLE seems more natural in cases of latent variables; for example, a moving average (MA) model or a generalized autoregressive conditional heteroskedasticity (GARCH) model can be directly estimated by MLE (by directly I mean it is enough to specify a likelihood function and submit it to an optimization routine) -- but not by method of moments (although indirect solutions utilizing the method of moments may exist).

Richard Hardy
  • 54,375
  • 10
  • 95
  • 219
  • 4
    +1. Of course, there are plenty of cases when you don't want the most likely estimate, such as Gaussian Mixture Models (i.e. unbounded likelihood). In general, a great answer to help intuition of MLE's. – Cliff AB Nov 22 '15 at 21:03
  • 4
    (+1) But I think you need to add a definition of the "most likely" parameter value as that given which the data is most probable to be quite clear. Other intuitively desirable properties of an estimator unrelated to its long-term behaviour under repeated sampling might include its not depending on how you parametrize a model, & its not producing *impossible* estimates of the true parameter value. – Scortchi - Reinstate Monica Nov 23 '15 at 12:40
  • 6
    Think there's still a risk of "most likely"'s being read as "most probable". – Scortchi - Reinstate Monica Nov 23 '15 at 13:05
  • @Scortchi, I am not sure how I could improve it. "Most probable" is basically the same as "most likely" for me, once I think about likelihood / probability density. – Richard Hardy Nov 23 '15 at 13:11
  • 3
    @RichardHardy: They're not at all alike. [Most likely, the sun has gone out. Most probably, it hasn't.](https://xkcd.com/1132/) – user2357112 supports Monica Nov 23 '15 at 17:36
  • I'm not so sure we can say the maximum likelihood estimates are actually the most likely, at least not under any frequentist interpretation. It is after all the likelihood of the data that we're talking about. – dsaxton Nov 23 '15 at 18:29
  • 1
    @dsaxton: See [What is the difference between “likelihood” and “probability”?](http://stats.stackexchange.com/q/2641/17230). Likelihood is the probability of the data considered as a function of a parameter with the data given; so "maximum likelihood" refers to a maximization over parameter values, & we speak of the likelihood of a parameter value, *not* of the data. "Most likely", though quite often heard, hasn't quite reached the status of a standard statistical term; unless & until it does we should IMO take care to explain it. So the argument here can be paraphrased as ... – Scortchi - Reinstate Monica Nov 24 '15 at 09:49
  • ... "Why would you choose a value that makes the data observed less probable when you can choose the value that makes the data observed more probable than any other?". (For continuous random variables read "probability density" for "probability" throughout, or reflect that all our measurements have finite precision.) – Scortchi - Reinstate Monica Nov 24 '15 at 09:56
  • @Scortchi I know the difference between a density and a probability. The fact that the likelihood function can be viewed as a function of model parameters doesn't make it the likelihood "of" those parameters. It's the likelihood of the data as a function of model parameters. I don't think we can talk about the likelihood of the parameters themselves without putting priors on them. – dsaxton Nov 24 '15 at 14:08
  • 3
    @dsaxton: Statisticians have been differentiating the *likelihood* of a parameter value given the data from the *probability* of the data given a parameter value for nearly a century - see [Fisher (1921) "On the 'probable error of a correlation", *Metron*, **1**, pp 3-32](https://digital.library.adelaide.edu.au/dspace/bitstream/2440/15169/1/14.pdf) & [Pawitan (2013), *In All Likelihood: Statistical Modelling & Inference Using Likelihood*](https://books.google.co.uk/books?id=8T8fAQAAQBAJ) - so even though the terms are synonymous in ordinary usage it seems a bit late now to object. – Scortchi - Reinstate Monica Nov 24 '15 at 15:23
  • Oh, & I brought up densities as an afterthought, just in case any reader was wondering "Hmm ... what about continuous variables?" - not because I thought anyone in particular needed the difference pointing out. – Scortchi - Reinstate Monica Nov 24 '15 at 15:48
  • 1
    @Scortchi, I tried once more. I believe your comments helped me understand the matter better; I also hope I did not introduce new mistakes :) – Richard Hardy Nov 24 '15 at 17:42
12

Actually, the scaling of the maximum likelihood estimates in order to obtain unbiased estimates is a standard procedure in many estimation problems. The reason for that is that the mle is a function of the sufficient statistics and so by the Rao-Blackwell theorem if you can find an unbiased estimator based on sufficient statistics, then you have a Minimum Variance Unbiased Estimator.

I know that your question is more general than that but what I mean to emphasize is that key concepts are intimately related to the likelihood and estimates based on it. These estimates might not be unbiased in finite samples but they are asymptotically so and moreover they are asymptotically efficient, i.e. they attain the Cramer-Rao bound of variance for unbiased estimators, which might not always be the case for the MOM estimators.

JohnK
  • 18,298
  • 10
  • 60
  • 103
11

To answer your question of why the MLE is so popular, consider that although it can be biased, it is consistent under standard conditions. In addition, it is asymptotically efficient, so at least for large samples, the MLE is likely to do as well or better as any other estimator you may cook up. Finally, the MLE is found by a simple recipe; take the likelihood function and maximize it. In some cases, that recipe may be hard to follow, but for most problems, it is not. Plus, once you have this estimate, we can derive the asymptotic standard errors right away using Fisher's information. Without using the Fisher's information, it is often really hard to derive the error bounds.

This is why MLE estimation is very often the go to estimator (unless you're a Bayesian); it's simple to implement and likely to be just as good if not better than anything else you need to do more work to cook up.

Cliff AB
  • 17,741
  • 1
  • 39
  • 84
  • 1
    Can you please elaborate as to how it compares to the method of moments, as this seems to be an important part of the OP? – Antoni Parellada Nov 22 '15 at 17:41
  • 1
    as pointed out by whuber, the MOM estimators are also biased, so there's not a "unbiased-ness" advantage to the MOM estimators. Also, when the MOM and MLE estimators disagree, the MLE tends to have lower MSE. But this answer is really about why MLE's tend to be the default, rather than a direct comparison to other methods. – Cliff AB Nov 22 '15 at 18:19
  • 3
    @AntoniParellada There is an interesting thread in comparing MLE and MoM, http://stats.stackexchange.com/q/80380/28746 – Alecos Papadopoulos Nov 22 '15 at 18:54
3

I'd add that sometimes (often) we use an MLE estimator because that's what we got, even if in an ideal world it wouldn't be what we want. (I often think of statistics as being like engineering, where we use what we got, not what we want.) In many cases it's easy to define and solve for the MLE, and then get a value using an iterative approach. Whereas for a given parameter in a given situation there may be a better estimator (for some value of "better"), but finding it may require being very clever; and when you're done being clever, you still only have the better estimator for that one particular problem.

eac2222
  • 673
  • 6
  • 11
  • 1
    Out of curiosity, what's an example of what (in the ideal world) you would want? – Glen_b Nov 24 '15 at 23:11
  • 2
    @Glen_b: Dunno. Unbiased, lowest variance, easy to compute in closed form? When you first learn the estimators for least-squares regression, life seems simpler than it turns out to be. – eac2222 Nov 30 '15 at 14:06