1

In maximum likelihood estimation, we maximise the likelihood.

I don't understand how this is possibly: for any reasonable dataset, the likelihood of hitting that EXACT data set is obviously zero! So how can we ever hope to maximize it?

For example, take data for a bunch of coin flips. The coin is fair, and you want to estimate the likelihood at $p = 0.5$. This is obviously the optimal value, since the coin is fair.

Yet if you have BILLIONS of data points, your likelihood will be $\prod_{i=1}^{BILLIONS} p_i^{x_i}(1-p_i)^{1 - x_i}$.

It doesn't matter how your data looks like: that product will ALWAYS be zero. There's no way around it. You're multiplying BILLIONS of small numbers: so the value is zero.

Numerically then (which is how most likelihood methods are implemented) how one would possibly maximize a value that is identically zero?

Richard Hardy
  • 54,375
  • 10
  • 95
  • 219
Name
  • 11
  • 1

2 Answers2

4

You seem to be concerned with the numerical aspect of likelihood maximization. We often work with the logarithm of the likelihood which makes things more convenient numerically. The log-likelihood is $\sum_{i=1}^\text{billions}[x_i\ln(p)+(1-x_i)\ln(1-p)]$. For $p=0.5$, $\ln(p)\approx-0.69$, so there should be no small-number problem. Nor should there be a large-number problem as long as the sample size is merely on the order of billions.

Regarding

for any reasonable dataset, the likelihood of hitting that EXACT data set is obviously zero

you may note that likelihood is numerically the same as probability mass (for discrete distributions) or probability density (for continuous distributions), even though when seen as functions these have different arguments (parameters for likelihood vs. data for PMF or PDF). These are usually not zero for data that are possible under the model. E.g. if $X\sim N(\mu,\sigma^2)$, then $p_X(x)\neq 0$ for all $x\in R$ even though $P(X=x)=0$ for any $x\in R$.

Richard Hardy
  • 54,375
  • 10
  • 95
  • 219
0

Your question is really very deep and roots in calculus (measure theory), see similar one here: https://stats.stackexchange.com/a/273407/36041

In case of MLE, the simple analogy is the density of an object. Imagine you got a chunk of sausage, I asked you tell me where is it the thickest? I point you to a segment, but you tell me that if you keep slicing the sausage thinner and thinner until this segment is thinner than a molecule, at some point the weight of a slice becomes basically zero. So, you'd say "how come the thickest piece of a sausage has a weight zero?"

So your question is very similar. MLE finds the thickest place on a sausage.

Aksakal
  • 55,939
  • 5
  • 90
  • 176