14

This is kind of an odd thought I had while reviewing some old statistics and for some reason I can't seem to think of the answer.

A continuous PDF tells us the density of observing values in any given range. Namely, if $X \sim N(\mu,\sigma^2)$, for example, then the probability that a realization falls between $a$ and $b$ is simply $\int_a^{b}\phi(x)dx$ where $\phi$ is the density of the standard normal.

When we think about doing an MLE estimate of a parameter, say of $\mu$, we write the joint density of, say $N$, random variables $X_1 .. X_N$ and differentiate the log-likelihood wrt to $\mu$, set equal to 0 and solve for $\mu$. The interpretation often given is "given the data, which parameter makes this density function most plausible".

The part that is bugging me is this: we have a density of $N$ r.v., and the probability that we get a particular realization, say our sample, is exactly 0. Why does it even make sense to maximize the joint density given our data (since again the probability of observing our actual sample is exactly 0)?

The only rationalization I could come up with is that we want to make the PDF is peaked as possible around our observed sample so that the integral in the region (and therefore probability of observing stuff in this region) is highest.

Tim
  • 108,699
  • 20
  • 212
  • 390
Alex
  • 243
  • 1
  • 4
  • 1
    For the same reason we use probability densities https://stats.stackexchange.com/q/4220/35989 – Tim Jan 06 '19 at 16:58
  • I understand (I think) why it makes sense to use densities. What I don't understand is why it makes sense to maximize a density conditional on observing a sample that has 0 probability of occurring. – Alex Jan 06 '19 at 17:12
  • 3
    Because probability densities tell us what values are relatively more likely then others. – Tim Jan 06 '19 at 17:24
  • If you have the time to answer the question fully, I think that would be more helpful for me and the next person. – Alex Jan 06 '19 at 17:35
  • 1
    Because, fortunately, the likelihood is not a probability! – AdamO Jan 07 '19 at 18:02

1 Answers1

19

The probability of any sample, $\mathbb{P}_\theta(X=x)$, is equal to zero and yet one sample is realised by drawing from a probability distribution. Probability is therefore the wrong tool for evaluating a sample and the likelihood it occurs. The statistical likelihood, as defined by Fisher (1912), is based on the limiting argument of the probability of observing the sample $x$ within an interval of length $\delta$ when $\delta$ goes to zero (quoting from Aldrich, 1997):

$\qquad\qquad\qquad$ Aldrich, J. (1997) Statistical Science12, 162-176

when renormalising this probability by $\delta$. The term of likelihood function is only introduced in Fisher (1921) and of maximum likelihood in Fisher (1922).

Although he went under the denomination of "most probable value", and used a principle of inverse probability (Bayesian inference) with a flat prior, Carl Friedrich Gauß had already derived in 1809 a maximum likelihood estimator for the variance parameter of a Normal distribution. Hald (1999) mentions several other occurrences of maximum likelihood estimators before Fisher's 1912 paper, which set the general principle.

A later justification of the maximum likelihood approach is that, since the renormalised log-likelihood of a sample $(x_1,\ldots,x_n)$ $$\frac{1}{n} \sum_{i=1}^n \log f_\theta(x_i)$$ converges to [Law of Large Numbers]$$\mathbb{E}[\log f_\theta(X)]=\int \log f_\theta(x)\,f_0(x)\,\text{d}x$$(where $f_0$ denotes the true density of the iid sample), maximising the likelihood [as a function of $\theta$] is asymptotically equivalent to minimising [in $\theta$] the Kullback-Leibler divergence $$\int \log \dfrac{f_0(x)}{f_\theta(x)}\, f_0(x)\,\text{d}x=\underbrace{\int \log f_0(x)\,f_0(x)\,\text{d}x}_{\text{constant}\\\text{in }\theta}-\int \log f_\theta(x)\,f_0(x)\,\text{d}x$$ between the true distribution of the iid sample and the family of distributions represented by the $f_\theta$'s.

Xi'an
  • 90,397
  • 9
  • 157
  • 575
  • Thanks for the answer. Could you expand a bit on the KL argument? I'm not seeing how this is the case immediately. – Alex Jan 07 '19 at 16:24
  • 1
    “ Probability is therefore the wrong tool for evaluating a sample and the likelihood it occurs.” Now everything makes sense to me. Thanks for the great answer. – Aroonalok Jul 21 '21 at 03:52