85

What is the reason that a likelihood function is not a pdf (probability density function)?

Macro
  • 40,561
  • 8
  • 143
  • 148
John Doe
  • 1,275
  • 2
  • 15
  • 24
  • 15
    The _likelihood function_ is a function of the unknown parameter $\theta$ (conditioned on the data). As such, it does typically not have area 1 (i.e. the integral over all possible values of $\theta$ is not 1) and is therefore by definition not a pdf. – MånsT Jun 27 '12 at 18:35
  • 6
    The same question on MO 2 years ago: http://mathoverflow.net/questions/10971/why-isnt-likelihood-a-probability-density-function – Douglas Zare Jun 27 '12 at 20:35
  • 6
    Interesting reference, @Douglas. The answers are rather unsatisfactory, IMHO. The accepted one assumes things that just aren't true ("both $p(X|m)$ and $p(m|X)$ are pdfs": *not*!) and the others don't really get at the statistical issues. – whuber Jun 27 '12 at 22:18
  • 4
    +1 whuber. This is amazing that there are so bad answers in the mathoverflow site in spite of its so high mathematical level ! – Stéphane Laurent Jun 28 '12 at 18:28
  • 3
    @Stephane: This is true, but statisticians and even probabilists seem to be fairly few and far between on MO, with some notable exceptions. That question was from fairly early in MO's existence when both the generally admissible questions and quality of answers were substantially different. – cardinal Jun 28 '12 at 20:30
  • 1
    I *am* very happy to see @Douglas wander over here recently. I'm looking forward to his continued participation as I feel he is and will be a real asset to the site. – cardinal Jun 28 '12 at 20:32

5 Answers5

85

We'll start with two definitions:

  • A probability density function (pdf) is a non-negative function that integrates to $1$.

  • The likelihood is defined as the joint density of the observed data as a function of the parameter. But, as pointed out by the reference to Lehmann made by @whuber in a comment below, the likelihood function is a function of the parameter only, with the data held as a fixed constant. So the fact that it is a density as a function of the data is irrelevant.

Therefore, the likelihood function is not a pdf because its integral with respect to the parameter does not necessarily equal 1 (and may not be integrable at all, actually, as pointed out by another comment from @whuber).

To see this, we'll use a simple example. Suppose you have a single observation, $x$, from a ${\rm Bernoulli}(\theta)$ distribution. Then the likelihood function is

$$ L(\theta) = \theta^{x} (1 - \theta)^{1-x} $$

It is a fact that $\int_{0}^{1} L(\theta) d \theta = 1/2$. Specifically, if $x = 1$, then $L(\theta) = \theta$, so $$\int_{0}^{1} L(\theta) d \theta = \int_{0}^{1} \theta \ d \theta = 1/2$$

and a similar calculation applies when $x = 0$. Therefore, $L(\theta)$ cannot be a density function.

Perhaps even more important than this technical example showing why the likelihood isn't a probability density is to point out that the likelihood is not the probability of the parameter value being correct or anything like that - it is the probability (density) of the data given the parameter value, which is a completely different thing. Therefore one should not expect the likelihood function to behave like a probability density.

Macro
  • 40,561
  • 8
  • 143
  • 148
  • 22
    +1 A subtle point is that even the appearance of the "$d\theta$" in the integral is *not* part of the likelihood function; it comes from nowhere. Among the many ways to see this, consider that a reparameterization changes nothing essential about the likelihood--it is merely a renaming of the parameter--but will change the integral. E.g., if we parameterized the Bernoulli distributions with the log odds $\psi=\log(\theta/(1-\theta))$ then the integral would not even converge. – whuber Jun 27 '12 at 21:03
  • @whuber - I know that MLEs are invariant to monotone transformation but clearly the shape of the likelihood as a function of $\psi$ is different from the shape as a function of $\theta$, so the integrals would be different. Isn't this akin to how the normalizing constant can be different when you transform a random variable? Perhaps I've missed your point.. – Macro Jun 28 '12 at 00:48
  • 5
    That's one way to put it: MLEs are invariant under monotone transformations *but probability densities are not,* QED! This was exactly Fisher's argument, which I have sketched in a comment to @Michael Chernick's reply. – whuber Jun 28 '12 at 00:53
  • A quibble - is that the right definition of PDF? Suppose we have $f(x) = 1$ if $x$ is between 0 and 1 and irrational, and $f(x) = 0$ otherwise. My probability/measure theory is too rusty to remember if you need continuity or something stronger/weaker, but I can see my $f$ giving some very anti-intuitive results. – Patrick Caldon Jun 28 '12 at 01:56
  • @PatrickCaldon - I think any non-negative integrable function is proportional to a probability density. You surely do not need continuity - consider the density of a variable defined by a mixture distribution e.g. a 50/50 mixture between a ${\rm uniform}(0,1)$ and ${\rm uniform}(2,3)$ has a discontinuous density. I guess I don't see the problem with the example you've given. If you can calculate $ \int_{D} f$ then you can assign a probability to the set $D$. – Macro Jun 28 '12 at 02:10
  • @Macro - I see your point, and I'm pretty sure I'm wrong. – Patrick Caldon Jun 28 '12 at 02:35
  • 10
    +1 for whuber's comment. The "$d\theta$" has not even a sense in general because there's not even a $\sigma$-field in the parameter space ! – Stéphane Laurent Jun 28 '12 at 18:15
  • Agreed @StéphaneLaurent. I was just making it plainly clear that the likelihood is not a density by showing that a necessary condition for a function to be a density is not satisfied. – Macro Jun 28 '12 at 18:30
  • 1
    @PatrickCaldon The only continuity constraint is on the cdf, which requires right-continuity. You need this so your probability doesn't go from defined to undefined and (possibly) back again, which would be weird. I'm not 100% sure but I think so long as you have your cdf, and so a probability, you don't even have to be able to solve $\int_D f$. If you can that just ensures that the RV is continuous. – Joey Jun 29 '12 at 19:58
  • @whuber I'm confused. I thought the likelihood is the product of the probability of all points in the sample space. What am I missing here? – hbak Feb 07 '17 at 15:01
  • 1
    @hbak It's the joint density of the sample data. It only becomes a product if we further assume that the samples are independent of one another. – SigmaX Dec 05 '18 at 15:16
  • """ Perhaps even more important than this technical example showing why the likelihood isn't a probability density is to point out that the likelihood is not the probability of the parameter value being correct or anything like that - it is the probability (density) of the data given the parameter value, which is a completely different thing""" Wait, that paragraph says a likelihood isn't a probability density because it's a probability density. – Hillary Sanders Jun 29 '21 at 21:58
4

Okay but the likelihood function is the joint probability density for the observed data given the parameter $θ$. As such it can be normalized to form a probability density function. So it is essentially like a pdf.

Michael R. Chernick
  • 39,640
  • 28
  • 74
  • 143
  • 3
    So, you're just pointing out that the likelihood is integrable with respect to the parameter (is that always true?). I suppose you may be alluding to the likelihood's relationship to the posterior distribution when a flat prior is used, but without more explanation this answer remains mysterious to me. – Macro Jun 27 '12 at 21:45
  • I was just following up on your statement " it is the probability of the data given the parameter value, " It is a joint probability density and everyone is saying that it is not a pdf because it doesn't integrate to 1. But I think of it the way you stated it which makes it essentially a pdf only neeing normalization. Similarly likelihood x prior is a posterior density up to normalization. – Michael R. Chernick Jun 27 '12 at 22:27
  • 7
    Integrating to unity is beside the point. Fisher, in a 1922 paper *On the Mathematical Foundations of Theoretical Statistics,* observed that indeed usually the likelihood $L(\theta)$ can be "normalized" to integrate to unity upon multiplying by a suitable function $p(\theta)$ so that $\int L(\theta)p(\theta)d\theta=1$. What he objected to is the *arbitrariness*: there are many $p$ that work. "...the word probability is wrongly used in such a connection: probability is a ratio of frequencies, and about the frequencies of such values we can know nothing whatever." – whuber Jun 28 '12 at 00:51
  • But @whuber, I think that what Michael is trying to say is that the integration/sum of the likelihood (with respect to the data) to 1 **can always be done**. Recall that, in fact, $p(x|\theta)=L(\theta)$ (as you stated in one of your comments). For the Bernoulli example, this is the elementary result of summing the binomial distribution with $n=1$. – Néstor Jun 28 '12 at 01:31
  • (A point to which I also agree). – Néstor Jun 28 '12 at 01:32
  • @whuber Sorry that bothers you so much that I think of the likelihood function as being a density function. I am not talking about a normalizing function P(θ) as in your quote from Fisher. As Nestor says I am thinking of likelihood as a joint probability density for the observations given the parameter(s). – Michael R. Chernick Jun 28 '12 at 01:45
  • 1
    @Néstor (and Michael) - it appears that whuber and I both interpreted this question as asking why the likelihood is not a density function, **as a function of $\theta$** so it appears we are answering different questions. Of course the likelihood is the density function of the observations (given the parameter value) - that is how it's defined. – Macro Jun 28 '12 at 01:55
  • @Macro I see. The OP wrote "Why is the likelihood function not a pdf? Kind of ambiguous. Why did you MansT and whuber iinterpret that to mean a density with respect to theta. I would never think of it as a density with respect to theta and in factas whuber points out in some parameterizations the likelihood may not even be integrable with respect to the parameter. – Michael R. Chernick Jun 28 '12 at 02:08
  • 2
    Michael, I think we interpreted it that way because the likelihood is a function of $\theta$ so, if it were a density, then it would be a density in $\theta$. I can imagine interpreting it the way you have but that possibility didn't occur to me until after reading Nestor's comment. – Macro Jun 28 '12 at 02:14
  • 4
    I find the ambiguity is created by this answer but is not present in the question. As @Macro points out, the likelihood is a function *only* of the parameter. (*E.g.*, "The density $f(x_1,\theta)\cdots f(x_n,\theta)$, considered for fixed $x$ as a function of $\theta$, is called the *likelihood function*: E. L. Lehmann, *Theory of Point Estimation*, section 6.2.) Thus the question is clear. Replying, then, that the "likelihood is the joint probability density" does not clarify but confuses the issue. – whuber Jun 28 '12 at 13:10
  • 2
    @whuber I guess technically the common usage of the term likelihood function means what Lehmann defined it as in his book. But the function itself can be viewed as a function of the parameters for fixed values of the observations or as a function of the observations for given values of the parameters. So when the question is asked "Why is the likelihood function not a pdf?" I was think that one can view it as a pdf for the data given the parameters. I think it is a plausible interpretation. But I can now see how you would interpret it your way based on the formal definition. – Michael R. Chernick Jun 28 '12 at 14:06
  • 2
    @MichaelChernick why would the likelihood could be viewed as a function of the parameters. The likelihood is a function from $\Theta$ to $\mathbb{R}$, if you view it as a function from the observations space then this is another function and this is not the likelihood. – Stéphane Laurent Jun 28 '12 at 17:59
  • 2
    As a function of the data the likelihood is a pdf. Formally it is not called the likelihood function. I just interpreted the question a little differently. – Michael R. Chernick Jun 28 '12 at 18:28
3

The likelihood is defined as $\mathcal{L}(\theta; x_1,...,x_n) = f(x_1,...,x_n; \theta)$, where if f(x; θ) is a probability mass function, then the likelihood is always less than one, but if f(x; θ) is a probability density function, then the likelihood can be greater than one, since densities can be greater than one.

Normally observations are treated iid, then:
$\mathcal{L}(\theta; x_1,...,x_n) = f(x_1,...,x_n; \theta) = \prod_{j} f(x_j; \theta)$

Let's see its original form:

According to the Bayesian inference, $f(x_1,...,x_n; \theta) = \frac{f(\theta; x_1,...,x_n) * f(x_1,...,x_n)}{f(\theta)}$ holds, that is $\hat{\mathcal{L}} = \frac{posterior * evidence}{prior}$. Notice that the maximum likelihood estimate treats the ratio of evidence to prior as a constant(see answers of this question), which omits the prior beliefs. The likelihood has a positive correlation with the posterior which is based on the estimated parameters. $\hat{\mathcal{L}}$ may be a pdf but $\mathcal{L}$ is not since $\mathcal{L}$ is just a part of $\hat{\mathcal{L}}$ which is intractable.

For example, I don't know the mean and standard variance of a Gaussian distribution and want to get them by training using a lot of observation from that distribution. I first initialize the mean and standard variance randomly(which defines a Gaussian distribution), and then I take one case and fit into the estimated distribution and I can get a probability from the estimated distribution. Then I continue to put the case in and get many probabilities and then I multiply these probabilities and get a score. This kind of score is the likelihood. Hardly can it be a probability of a certain pdf.

Lerner Zhang
  • 5,017
  • 1
  • 31
  • 52
0

I'm not a statistician, but my understanding is that while the likelihood function itself is not a PDF with respect to the parameter(s), it is directly related to that PDF by Bayes Rule. The likelihood function, P(X|theta), and posterior distribution, f(theta|X), are tightly linked; not "a completely different thing" at all.

santayana
  • 17
  • 1
  • 3
    Welcome to our site! You might find interesting material in the comments to other answers in this thread. Some of them point out why Bayes' Rule does not apply unless additional mathematical machinery is explicitly introduced (such as a Sigma field for the parameter). – whuber Feb 02 '15 at 20:31
  • 1
    Thanks @whuber. I didn't notice any references to Bayes' Rule elsewhere in the thread, but I suppose there are allusions in the comments, assuming one is sufficiently fluent in graduate-level probability to pick up on them (which I'm not). Would you not agree that placing the likelihood function in the context of Bayes' Rule provides useful intuition for the OP's question? – santayana Feb 02 '15 at 21:09
  • 1
    Applying Bayes' rule is not possible without assuming a probability distribution for $\theta$: the distinction between that distribution, and the distribution of the data as a function of $\theta$, is what almost everything in this thread is about. Implicitly assuming there is, or can be, such a distribution is the source of the confusion discussed in the comment thread to Michael Chernick's answer. I would therefore agree that a clear and careful discussion of this point could be helpful, but anything short of that risks creating greater confusion. – whuber Feb 02 '15 at 22:02
  • My apologies, at first glance that thread seemed to amount to little more than a misunderstanding, but now I see the relevant comments you refer to, in particular your quote of Fisher. But does this not come down to a Bayesian v. Frequentist debate? Isn't there a large number of practitioners of Bayesian inference who would argue in favour of a probability distribution for theta? (whether you agree with them is another matter...) – santayana Feb 02 '15 at 22:44
  • 3
    Yes, the B vs. F debate is lurking here. A thoughtful frequentist will happily use Bayes' Rule when there exists a basis to adopt a prior distribution for $\theta$, but parts company from Bayesians by denying that we *must* adopt a prior. We can take our cue from how this question was phrased. If it had instead asked "why can one treat the likelihood function as a PDF (for the parameters)," that would have steered this conversation along Bayesian lines. But by asking it in the negative, the O.P. was looking for us to examine the likelihood from a frequentist point of view. – whuber Feb 02 '15 at 23:16
0

yoooo, lets make something clear. Likelihood is completely different from probability!, when we want to calculate the probability of for example getting x=0, when x is coming from a normal distribution with miu=0 and sigma=1, we need to define a bin, like 0.01, and integral the probability function there(pdf, in this case normal distribution). so we calculate the integral of normal distribution and put, for instance, -0.01 and 0.01 as input for outcome of integral.

BUTT, in likelihood, we just calculate the point value of the pdf... , this is totally different from the probability of the x to be 0, we just enter input x=0 to the function and get the outcome. for instance, by definition, the function y=1, for x=(0 to 1), the integral of this function is 1, so this can be a pdf, but the point value of each x in (0 to 1) is equal 1, which is not the probability of this ponits, just the value of pdf in those point.

now we get to why likelihood is used in modeling, when we maximize the likelihood function for a set of observed data, with respect to a assumed model(function with parameters to be found), in action this would maximize the probability of those data with respect to that model(function). we work with the likelihood, but in the end, probability (integral of pdf) become maximum as well.

edit1: for the second comment, youre right, my bad, I meant the integral of pdf over the data that we collected. the whole purpose of modeling is to fit a 'decided' model to a set of data in a way that the probability of those data considering our 'decided' model would be maximized. for this probability to be maximized, we need to calculate the integral of pdf over our data. by maximizing likelihood, we maximize the integral as well.

  • 3
    What does this add to already existing answers? – kjetil b halvorsen Dec 26 '21 at 11:58
  • Because the "integral of [the] pdf" is, by definition, always $1,$ it is hard to see what your last point might be. – whuber Dec 26 '21 at 14:02
  • Re the edit: because data are always *finite,* what do you mean by "integral of pdf over the data"?? – whuber Dec 27 '21 at 15:08
  • exactly, what i mentioned in text, for instance, the probability of getting x=0 from normal distribution with miu=0 and sigma=1, is '0'. for every point the probability is '0'. in general, the probability of finding someone with the height exactly equal to 180cm is '0', when i say 180cm, i mean 180.000000000000 ... cm, but if we define a bin, then we can calculate the probability of that "event" over our pdf. for this instance, we can calculate the integral of pdf over 179.5 and 180.5 cm, in order to calculate the probability of selecting someone randomly with the height 180 from a population. – omid omatali Dec 28 '21 at 15:44