How can I see that the maximum likelihood approach finds the parameter values of the probability distribution that maximize the probability of the observed sample? Maximum likelihood is not the maximum of probability in general because for, say, two continuous random variables $X_1, X_2$ we have $P(X_1=x_1,X_2=x_2)=0$.
-
As you say, you are maximizing $P(\text{data}|X_1, X_2)$. The likelihood function is not a probability distribution (doesn't integrate to 1) – stefgehrig Apr 08 '21 at 15:26
-
Yes, but how did people ever come up with "likelihood function" and why do we say that it's equivalent to maximizing the probability of observing the sample? How do we check this equivalence? – Alex Apr 08 '21 at 15:40
-
It's actually quite simple to check that by hand (for an example, at least). You write down the equation and algebraically find the maximum [example here: https://stats.stackexchange.com/questions/181035/how-to-derive-the-likelihood-function-for-binomial-distribution-for-parameter-es]. The term likelihood function is used to distinguish it from probabibility functions, but in essence the likelihood is a probabability – stefgehrig Apr 08 '21 at 15:50
-
Could you algebraically show that "in essence"? Say, I define likelihood as some other weird function (e.g., sin of likelihood). How do you prove that this weird function is nonsense? How did people ever come up with Likelihood function? You can't just say: Here is the Likelihood function, let's maximize it and use the maximizers in whatever we want. There must be some algebra that results in the likelihood function. – Alex Apr 08 '21 at 15:56
-
Once you assume a distribution (see answer by rapaio), the likelihood function comes naturally. See the binomial distribution in the example I linked. Another example would be: you assume your response data is normal, so your likelihood function will be based on the Gaussian PDF. – stefgehrig Apr 08 '21 at 15:58
1 Answers
It does not find a probability distribution. It finds some values for parameters of a fixed, apriori assumed as being the true, probability. So, if the true probability is the probability you assumed, then it fits parameter values of that probability function for which the data at hand achives it's maximum likelihood.
In the expression "it maximizes the probability of the observed sample" there are two components. The distribution family (the functional form of the distribution) which is asumed by you when you find parameters. There is no proof for that. You work with it or not. And the fitted estimated values for probability parameters. That estimates for parameters together with the functional form is the probability you for which you optimize the observed samples.
[Later] I think the confusion comes from the usage of probability word. Sometimes it means a probability distribution family and sometimes it means a probaility distribution with known parameter values. I think the later meaning is more appropriate, keeping separate a term for distribution family, but I confess I don't keep rigor, probably from a bad habit.

- 6,394
- 25
- 45