6

In MLE for continuous case, my course notes define the likelihood function to be:

$$ L(\theta) = L(\theta;y) = \prod_{i=1}^n f(y_i;\theta) $$

Where $f$ is the joint pdf of $y_i$ given $\theta$.

I thought normally it doesn't make sense to evaluate a pdf at a point (eg. $f(X=2) = 0$ for some continuous rv X), and that it only make sense to evaluate for the random variable to be $>$ or $<$ to a value.

But in the case of MLE, why is it OK to evaluate a joint pdf at a point?

foobar
  • 643
  • 1
  • 7
  • 17
  • 6
    Your premise is flawed. It doesn't make sense to treat density as probability, but that doesn't mean it's not okay to evaluate a density if you treat it as *density*. – Glen_b Oct 30 '16 at 23:57
  • 2
    First, great question! Personally, the term "likelihood" has not been very rigorously defined in statistics literature. One detailed, but maybe a little technical answer to your question can be found in Sec. 6.2 of _The Statistical Analysis of Failure Time Data_ by Kalbfleisch and Prentice (with the emphasize on survival analysis model). For the classical one-sample statistical models, roughly you can still start with the probability argument (which is natural), then by considering the limit process to obtain the density exposition. I will come back to give a formal answer if I got time. – Zhanxiong Oct 31 '16 at 04:34
  • Please do. I roughly have an idea what the other answers are about. But I don't know how why in this case we're interpreting it as a density and not a probability... – foobar Nov 01 '16 at 04:33

2 Answers2

5

Intuitively, @dsaxton's answer provides the correct logic. Let me just "translate" it to math language.

Suppose the sample $Y_1, \ldots, Y_n \text{ i.i.d.} \sim f_\theta(y)$, $\theta \in \Theta$, where $\Theta$ is the parameter space and $f_\theta(\cdot)$ are density functions. After a vector of observations $y = (y_1, \ldots, y_n)'$ has been made, the maximum likelihood principle aims to look for an estimator $\hat{\theta} \in \Theta$ such that for any $\delta > 0$, the probability \begin{equation} P_{\hat{\theta}}[(Y_1, \ldots, Y_n) \in (y_1 - \delta, y_1 + \delta) \times \cdots \times (y_n - \delta, y_n + \delta)] \tag{1} \end{equation} is the maximum over $\theta \in \Theta$, which suggests us considering the quantity \begin{equation} L(\theta, \delta) \equiv P_{\theta}[(Y_1, \ldots, Y_n) \in (y_1 - \delta, y_1 + \delta) \times \cdots \times (y_n - \delta, y_n + \delta)], \quad \theta \in \Theta.\tag{2} \end{equation}

Note that from the statistical inference point of view, the probability in $(2)$ should be viewed as a function of $\theta$.

We now simplify $(2)$ by invoking the i.i.d. assumption and that $P_\theta$ has density $f_\theta$. Clearly,

$$L(\theta, \delta) = \prod_{i = 1}^n P_\theta[y_i - \delta < Y_i < y_i + \delta] = \prod_{i = 1}^n \int_{y_i - \delta}^{y_i + \delta}f_\theta(y) dy. \tag{3}$$

What we need to show, thus answer your question is: if $\hat{\theta}$ maximizes $L(\theta, \delta)$ with respect to $\theta \in \Theta$ for any $\delta > 0$, then it also maximizes the so-called likelihood function $$L(\theta) = \prod_{i = 1}^n f_{\theta}(y_i). \tag{4}$$

So let's start with assuming for any $\delta > 0$, $$L(\hat{\theta}, \delta) \geq L(\theta, \delta), \quad \forall \theta \in \Theta. \tag{5}$$

Divide both sides of $(5)$ by $(2\delta)^n$, then let $\delta \downarrow 0$ gives $$\prod_{i = 1}^n f_{\hat{\theta}}(y_i) \geq \prod_{i = 1}^n f_{\theta}(y_i), \quad \forall \theta \in \Theta,$$ which is precisely $L(\hat{\theta}) \geq L(\theta), \forall \theta \in \Theta$.

In other words, maximizing the so-called likelihood function $(4)$ (which is a product of densities) is a necessary condition of carrying out the maximum likelihood principle. From this point of view, the form of densities multiplication makes sense.

Above is just my own interpretation, any comment or critic is very welcomed.

Zhanxiong
  • 5,052
  • 21
  • 24
  • Not for any $\delta > 0$. Imagine a distribution with an arbitrarily narrow high peak with almost zero height distribution around it (think a pulse) and then after a while a plateau of considerable width and slightly lower height and now apply your explanation to it with $\delta = width of plateau$. Your explanation would give plateau's middle as the ML estimate. In your explanation letting $\delta \downarrow 0$ is wrong when you have assumed it to be anything $>0$ – MiloMinderbinder May 12 '18 at 03:06
  • 1
    specifically pointing out: the maximum likelihood principle aims to look for an estimator $\hat{\theta} \in \Theta$ such that for _arbitrarily small_ $\delta > 0$, the probability \begin{equation} P_{\hat{\theta}}[(Y_1, \ldots, Y_n) \in (y_1 - \delta, y_1 + \delta) \times \cdots \times (y_n - \delta, y_n + \delta)] \tag{1} \end{equation} is the maximum over $\theta \in \Theta$ – MiloMinderbinder May 12 '18 at 03:09
4

It's not "incorrect" to look at a density function at a specific point. If it were, what's the point of the function?

What you may have heard is that a density function at a given point is not to be interpreted as a probability, but that doesn't make it unimportant. The density function tells you how "tightly packed" probability is around a certain point, or how likely a random variable is to be close to the given value, and this idea extends to random samples as well.

In the case of maximum likelihood estimation what we're doing then is finding a value for a parameter that causes the sample to belong to a "neighborhood" of high probability, relative to other regions of the sample space.

dsaxton
  • 11,397
  • 1
  • 23
  • 45
  • Well,< it do indeed make sense to say that densities are not ment to be evluated. They are really ment to be *integrated* A book taking that point of view is https://www.amazon.com/Data-Analysis-Approximate-Models-Location-Scale/dp/1482215861 – kjetil b halvorsen Feb 19 '17 at 18:05