What properties of the MLE makes it so useful for picking parameters? Why would we want to maximize the likelihood? Why not maximize $P(\text{Parameter} \mid \text{Data})$ or anything else?
-
2Because you know the data and not the parameters, there it is natural to ask which parameters should we pick such that this data is the most likely – Repmat Mar 12 '16 at 16:08
-
Hm - that intuition doesn't make sense. There are so many other logical things we can optimize for - like 0-1 loss. Why is MLE so popular? – Mar 12 '16 at 16:13
-
In terms of theory MLEs have some nice properties, like being functions of sufficient statistics and efficiency in terms of squared error. But aside from that the idea has a philosophical appeal as well. Regardless, we don't *always* use MLEs, like when estimating variances in regression models. – dsaxton Mar 12 '16 at 16:17
-
1Hmm - What is the philosophical appeal? – Mar 12 '16 at 16:27
-
https://en.wikipedia.org/wiki/Occam%27s_razor – Vossler Mar 12 '16 at 16:35
-
1I don't think Occam's razor applies here. How is the MLE the simplest hypothesis? – Mar 12 '16 at 16:36
-
5Much of your question is pretty much covered by the wikipedia article on maximum likelihood. Please check the help on [how to ask questions](http://stats.stackexchange.com/help/how-to-ask) which explains that you should both search for answers, and research your question so that you focus on asking things that aren't already well answered by a simple search. – Glen_b Mar 12 '16 at 17:13
3 Answers
Why would we want to maximize the likelihood?
Because - taking the form of the model as given - it's the set of parameters that give the best chance of giving us the sample.
Why not maximize P(Parameter∣Data) or anything else?
In the framework where you would maximize likelihood, parameters are fixed but unknown; they don't have distributions.
To have a notion that you can meaningfully calculate distributions for parameters, you're in a Bayesian way of looking at parameters ... so you wouldn't have a reason to maximize likelihood at all. However, you still use likelihood:
$P(\text{parameter}|\text{data}) \propto P(\text{data}|\text{parameter}) \,P(\text{parameter})$
MAP estimation does what you're asking about here; it's similar to maximizing likelihood but it incorporates the effect of the prior on the parameters.
As for "anything else" -- there are certainly cases where other estimators are used.
What properties of the MLE makes it so useful for picking parameters?
MLEs have many useful properties.
The big one (aside from consistency I guess) would be that they're asymptotically efficient, which means that in sufficiently large samples you can't really beat them.
They have a number of other properties that broadly speaking make them "nice" to work with (such as functional invariance -- in two different senses), but efficiency is a big selling point.

- 257,508
- 32
- 553
- 939
-
Comments are not for extended discussion; this conversation has been [moved to chat](http://chat.stackexchange.com/rooms/36900/discussion-on-answer-by-glen-b-why-maximize-the-likelihood). [Mehrdad disagrees with an aspect of this answer; the discussion of which is linked] – Glen_b Mar 13 '16 at 10:23
Glen_b gave the intuition but we can get a little more technical if we want to. Here is the technical reason then: we maximize the likelihood because asymptotically the likelihood function $L(\theta)$ is maximized at the true value, say $\theta_0$.
A short proof of that result is not difficult and we are going to use two of the regularity conditions of the mle, namely
The pdfs are distinct, i.e. the parameter $\theta$ identifies them
The pdfs have common support for all $\theta$.
What we are going to show is
$$\lim_{n \to \infty} P_{\theta_0} \left[ L\left(\theta_0, \mathbf{X} \right) > L\left (\theta,\mathbf{X} \right) \right] = 1, \ \text{for all} \ \theta \neq \theta_0$$
where the probability is taken with the parameter $\theta_0$, the true parameter. Taking logs and recalling the definition of $L\left (\theta,\mathbf{X} \right)$ as the joint distribution of the observed random sample, the inequality
$$L\left(\theta_0, \mathbf{X} \right) > L\left (\theta,\mathbf{X} \right)$$
is equivalent to
$$\frac{1}{n} \sum_{i=1}^n \log \left[ \frac{ f(X_i, \theta)}{f(X_i, \theta_0)} \right] <0 $$
Applying the Law of Large Numbers and Jensen's inequality for the concave function $\log(x)$, we have
$$\frac{1}{n} \sum_{i=1}^n \log \left[ \frac{ f(X_i, \theta)}{f(X_i, \theta_0)} \right] \xrightarrow{P} \mathbb{E}_{\theta_0} \left[ \log \frac{ f(X, \theta)}{f(X, \theta_0)} \right] < \log \mathbb{E}_{\theta_0} \left[ \frac{ f(X, \theta)}{f(X, \theta_0)} \right] $$
But the last term now equals $0$ since,
$$ \mathbb{E}_{\theta_0} \left[ \frac{ f(X, \theta)}{f(X, \theta_0)} \right] = \int_{-\infty}^{\infty} \frac{f(x;\theta)}{f(x;\theta_0)} f(x;\theta_0) = 1$$
Hence, the result holds asymptotically. Therefore, we have good reason to try to maximize the likelihood as we know that asymptotically this is the job of the true parameter.

- 18,298
- 10
- 60
- 103
-
This should be the accepted answer. The other answers don't answer OP's question of "why maximize likelihood". – Miheer Sep 20 '19 at 18:07
I (and I guess many others) would agree that P(Parameter∣Data) is the most interesting inferential quantity, because it truly quantifies our knowledge about the parameter we are interested in.
However, as @Glen_b points out, calculating P(Parameter∣Data) requires to specify a prior p(Parameter), which introduces a certain level of subjectivity.
At the beginning of the 20th century, Fisher was looking for a way to get rid of the prior, and thus to get a unique solution to the question about the "best" parameter. He showed various useful properties of the MLE = looking for the parameter that P(Data|Parameter). See also my answer to this question.
If you are swayed by Fishers argument, use MLE. If you prefer to estimate P(Parameter∣Data), you are doing Bayesian Inference, which requires specifying a prior.

- 6,499
- 22
- 36