Likelihood in Bayes theorem vs in MLE

Question

I know that similar questions have already been answered on this platform but none of them were really answering my specific question which is the following:

Bayes' theorem arises solely by rearranging the multiplicative law of probability: $$p(\theta|x)p(x) = p(x|\theta)p(\theta)$$ $$ p(\theta|x) = \frac{p(x|\theta)p(\theta)}{p(x)}$$

Hence, all the quantities involved are proper pmfs or pdfs. However, I constantly read that the likelihood in Bayes theorem wouldnt be a proper probability (pmf or pdf) since it is not normalized to one. How is that possible?

I understand the concept of the likelihood function $L(\theta|x)=p(x|\theta)$ in MLE and why it is not a pdf (or pmf) since it holds the random variable x fixed and varies the parameter $\theta$. However, this cannot be used in Bayes theorem, since Bayes theorem requires that the quantities involved are pdfs (or pmfs) otherwise it would be mathematically wrong. So which mistake am I making or what do I not know about the likelihood in Bayes theorem?

Here https://ocw.mit.edu/courses/mathematics/18-05-introduction-to-probability-and-statistics-spring-2014/readings/MIT18_05S14_Reading11.pdf is a numerical example where the likelihoods indeed do not add up to 1 in Bayes' theorem but I do not understand how this is possible since they should be probabilities and hence should add up to 1.

I marked your question as a duplicate of another one that seems to answer it. If it doesn't, please tell us why? TL;DR in MLE is *not* a conditional probability because $\theta$ is not a random variable, in Bayesian setting it is. — Tim, Feb 10 '20 at 21:08
Hi, thank you for your reply. Yes I understand why the MLE (likelihood function L) is not a conditional probability. But the issue is, that a lot of authors state, that you would use that likelihood function L as well in Bayes theorem. For example in the link I shared they used a numerical example (table on page 3) where indeed the likelihoods do not add up to 1. And the lecturer emphasizes that fact in point 5 on page 4. In fact, he states on p.2 "Likelihood: (This is the same likelihood we used for the MLE." when he talks about the likelihood in Bayes theorem. — guest1, Feb 11 '20 at 15:17
So it seems that many authors and lecturers seem to use the likelihood function L in Bayes theorem which is nonesensical exactly because of the fact that you stated: That in Bayes theorem it has to be a conditional probabiliy distribution where $\theta$ and $x$ are random variables — guest1, Feb 11 '20 at 15:18
Similarly, Gelman writes in his book on page 7: "The second term in this expression, p(x|θ), is taken here as a function of θ, not of x", when he talks about Bayes' theorem. But that again would be wrong right? Since the likelihood as in Bayes theorem should be a distribution over random variable x conditioned on random variable $\theta$ — guest1, Feb 11 '20 at 15:21
I cannot comment on that, because I don't know what you are referring to, but in Bayesian context parameters and data are *always* considered as random variables. Otherwise you cannot apply Bayes theorem. — Tim, Feb 11 '20 at 15:21
Maybe just post the exact quotes you find confusing? There's nothing incorrect with what Gelman says, likelihood is a conditional probability distribution of the data given a parameter, where the parameter is considered as a random variable, and you plug-in the possible values of the parameter to the likelihood function. — Tim, Feb 11 '20 at 15:24
Well with respect to the quote from Gelman's book what I find confusing that it should be the other way round right? It should be a "function" (actually a probability distribution") over x conditioned on $\theta$. So say, this would be a gaussian distribution, then it would be a gaussian over variables x with mean $\mu$ and $\sigma$ as parameters which however are also treated as random variables. Maybe in that case it is just a language thing, but I would then call this a function of x given $\theta$ not the other way round like Gelman says it. — guest1, Feb 11 '20 at 18:45
To the quotes from the link that I find confusing is the following: "Likelihood: (This is the same likelihood we used for the MLE.) The likelihood function is P(D|H), i.e., the probability of the data assuming that the hypothesis is true. Most often we will consider the data as fixed and let the hypothesis vary. For example, P(D|A) = probability of heads if the coin is typeA. In our case the likelihoods are P(D|A) = 0.5, P(D|B) = 0.6, P(D|C) = 0.9" — guest1, Feb 11 '20 at 18:46
And the other quote: " The likelihood column does not sum to 1. The likelihood function is not a probability function." — guest1, Feb 11 '20 at 18:47
So what I get form these quotes form the last two comments of mine is that instead of the probability distribution P(D|H) he is plugging in the likelihood function L because the likelihood function exactly makes the assumption that the data is fixed but the parameters vary (this is the whole principle in the MLE). But this should in my opinion (and I think you agree there with me) not be allowed in Bayes theorem — guest1, Feb 11 '20 at 18:51
why should it not be allowed? You update your prior given the data you observed, that's what Bayesian statistics are about. — Tim, Feb 11 '20 at 20:14
I mean you yourself have stated before that all the quantities in Bayes theorem have to be probability distributions, with which I agree. But I pointed out that the likelihood is *not* a probability distribution hence it should not be allowed to use it in Bayes theorem. — guest1, Feb 12 '20 at 07:45
Likelihood does not have to "sum to 1" to be a probability distribution. Moreover, it does integrate to one for *all possible* values of $x$: $\int p_\theta(x) dx = 1$, as it is a probability distribution. When evaluating likelihood function in Bayes theorem you ask what is the probability of observing *some particular*, observed $x$ value(s) given some particular value of $\theta$. — Tim, Feb 12 '20 at 08:21
So is the statement " The likelihood column does not sum to 1. The likelihood function is not a probability function." from that lecture I have been linked wrong? — guest1, Feb 12 '20 at 08:28
And why does the likelihood not have to sum to 1 to be a probability distribution? Isnt this part of the definition of a probability distribution? — guest1, Feb 14 '20 at 07:40
It does integrate to 1, but for *all the possible* values of $X$. $p(x|\theta) \ne 1$, but $\int_{-\infty}^\infty p(x|\theta) \, dx = 1$. When using Bayes theorem you are *evaluating* likelihood only on the few samples you observed, not taking integral over all possible values of $X$. — Tim, Feb 14 '20 at 08:04
Okay thank you. But then the statement from the lecture is wrong right?: "The likelihood column does not sum to 1. The likelihood function is not a probability function." Because the only reason that it does not sum to 1 is that it is not summed over all x I would assume after your explanation. So he cannot conclude that it isnt a probability distribution, right? — guest1, Feb 14 '20 at 13:00
I'll re-open the question since from your comments it is now more clear what exactly is the problem. — Tim, Feb 14 '20 at 13:23
See https://stats.stackexchange.com/questions/97515/what-does-likelihood-is-only-defined-up-to-a-multiplicative-constant-of-proport/97522#97522 — kjetil b halvorsen, Feb 14 '20 at 13:33

score 4 · Answer 1 · answered Feb 14 '20 at 14:31

If integrated the conditional probability you would get

$$ \int_\Theta p(\theta|x)d\theta = 1,$$ as expected - the posterior is a proper probability distribution, where I define proper to be that the integral over the parameter space is 1 and not just finite. But in many cases a probability distribution is in practice a product of bounded, positive functions, each individually not a proper probability distribution. In Bayes' theorem, the posterior is

$$\frac{p(x|\theta)p(\theta)}{p(x)},$$

but this puts no requirements on $p(x|\theta)$ or $p(\theta)$ individually: $p(x|\theta)$ is a probability distribution in $x$, but it is just a function in $\theta$. Thus, the integral $$\int_\Theta p(x|\theta) d\theta \neq 1$$ in many interesting cases.

Christian Hennig · Accepted Answer · 2020-02-14T14:44:48.517

4

Not sure whether I understand precisely what you don't understand. My impression is that it just confuses you that one can speak about $p(x|\theta)$ as both a "proper pmf/pdf" (if interpreted as function over $x$) and a likelihood (if interpreted as function over $\theta$).

The formula gives you the value for $p(\theta|x)$ for fixed values of $x$ and $\theta$, and for this it doesn't matter whether $p(x|\theta)$ is interpreted as function over $x$ or over $\theta$. So one can say that there are only proper pmfs/pdfs in the formula, but (interpreting differently what $p(x|\theta)$ is a function over) also that there's the likelihood in it, which is not a pdf/pmf. (One can also say that $p(\theta|x)$ and $p(x|\theta)$ are both functions of both $\theta$ and $x$, and again there's some freedom to focus on $x$ or $\theta$ when interpreting them.)

Actually for $p(\theta|x)$ to become a proper pdf/pmf over $\theta$ given $x$, $p(x|\theta)$ must be a pmf/pdf over $x$ for given $\theta$, which is just what it is. It does not have to be a pmf/pdf over $\theta$, which it isn't.

edited Feb 14 '20 at 14:44

answered Feb 14 '20 at 14:39

Christian Hennig

10,796
8
35

Okay I think your comment has made it a bit more clear to me. So to reiterate it in my own words: The likelihood L and $p(x|\theta)$ are given by the same formula. The difference is just what is interpreted as random variable and what as parameter. Or more precisely, with respect to which quantity it is taken to be a function/probability density. However, what is still not clear to me: If i plug in that formula but I interpret it as a likelihood function, then it doesn't have the properties of a probability density anymore and hence Bayes theorem could not be applied anymore – guest1 Mar 03 '20 at 11:30
Because if $p(x|\theta)$ is not interpreted as the conditional probability of x given $\theta$ but as function of $\theta$ (treated as parameters not as random variable) given fixed x, then the transformation properties are completely different. Hence, if I multiply it with, e.g., the prior probability density, then e.g. the multiplicative law of probability cannot be used. In particular, $p(x|\theta) p(\theta) = p(x, \theta)$ but $L(\theta) p(\theta) \neq p(x, \theta)$ – guest1 Mar 03 '20 at 11:32
The mathematical properties of $p(x|\theta)$ and its theory don't depend on whether you call it a likelihood or a conditional pmf/pdf. It's still $p(x|\theta)$, so if you call it $L(\theta)$ for given $x$, then still $L(\theta)p(\theta)=p(x,\theta)$ (although this would be somewhat confusing notation because it drops the dependence on $x$). It's mathematics, and mathematics is governed by well-defined mathematical objects, not by interpretations. – Christian Hennig Mar 03 '20 at 13:52
Hmm, but if x is not treated as a random variable, the multiplicative law of probability just doesnt apply does it? And the likelihood $L(\theta)$ is treating neither x as a random variable nor $\theta$. Hence, the multiplicative law of probability does not apply. So it certainly depends on how a quantity is interpreted. Just as an example: If I interpret some quantity y as a random variable then it has properties like a mean, a variance etc. If i treat y just like a parameter then it is just a real number. So it does depend on how i interpret (or more precisely: define) a quantity – guest1 Mar 03 '20 at 15:18
I can't follow you. If $L(\theta)=p(x|\theta)$, then $p(x,\theta)=p(x|\theta)p(\theta)=L(\theta)p(\theta)$. This of course relies on considering a specific $x$ that is dropped in the notation of $L(\theta)$. Note that $L(\theta)$ is a value, whereas $L(\bullet)$ is a (likelihood) function. $L(\theta)$ is its value at specific $\theta$, the very same value as $p(x|\theta)$, so just because you write it $L(\theta)$, it doesn't actually "know" or change its meaning according to the fact that it is written as a function of $\theta$. – Christian Hennig Mar 03 '20 at 15:31
The problem may be that you think of $L(\theta)$ as a function in $\theta$ (which is how many people talk about it), but with correct use of mathematical notation it's just a number, the likelihood evaluated at a specific value $\theta$; whereas $L$ or $L(\bullet)$ is the actual likelihood function. I have to admit that even I use sloppy wording that could suggest $L(\theta)$ is a function, which it actually isn't. – Christian Hennig Mar 03 '20 at 15:35
Okay so I think your last comment explained actually exactly what I did not understand. Thank you for this clarification! So if $L(\theta)$ is only a number and not the function, is $p(x|\theta)$ then also just a number and not the whole probability density? If so, how would I have to write $p(x|\theta)$ to denote the actual probability density and not just a specific value? – guest1 Mar 04 '20 at 07:12
The thing is that sloppy notation like this is everywhere, so you will see people using $p(x|\theta)$ for the density (which very often doesn't lead to problems but it seems your issue was caused by this), however correctly the density is $p(\bullet|\theta)$ or, written down in full detail, $p_\theta$ defined as $p_\theta(x)=p(x|\theta) \forall x$. – Christian Hennig Mar 04 '20 at 16:40
But does this imply that Bayes' theorem actually only applies when plugging a specific outcome x of the random variable X (and thus obtaining a numerical value for the probability density)? So I always thought that Bayes' theorem applies to the analytical expressions of the probability densities involved. I.e., for all possible outcomes x of the random variable X of $p(X=x|\theta)$. But does it now only apply if the outcome x had been fixed beforehands and hence $p(X=x|\theta)$ is a specific numerical value? – guest1 Mar 10 '20 at 11:29
It actually holds for *all* values, meaning that if you choose any as the "specific value", it will hold. The mathematical expression doesn't mind about whether something is "fixed beforehand" or not. It holds for whatever x you take, but then it's a statement for that x. (Actually because of this I wouldn't be surprised if one could find shorthand ways of writing it as a statement about a random variable rather than a value.) – Christian Hennig Mar 11 '20 at 12:05

Likelihood in Bayes theorem vs in MLE

2 Answers2