Why do people use $\mathcal{L}(\theta|x)$ for likelihood instead of $P(x|\theta)$?

Question

According to the Wikipedia article Likelihood function, the likelihood function is defined as:

$$ \mathcal{L}(\theta|x)=P(x|\theta), $$

with parameters $\theta$ and observed data $x$. This equals $p(x|\theta)$ or $p_\theta(x)$ depending on notation and whether $\theta$ is treated as random variable or fixed value.

The notation $\mathcal{L}(\theta|x)$ seems like an unnecessary abstraction to me. Is there any benefit to using $\mathcal{L}(\theta|x)$, or could one equivalently use $P(x|\theta)$? Why was $\mathcal{L}(\theta|x)$ introduced?

In this context, it reminds us that the likelihood function is a function of $\theta$ with the data $x$ fixed. On the other hand, the joint distribution is a function of the data $x$ given $\theta$. — knrumsey, Jun 11 '17 at 22:25
@BigAgnes Thanks. Aren't observed variables fixed by definition, though? I'm also confused why we can call $p(x|\theta)$ a joint distribution. Isn't it a scalar since both $x$ and $\theta$ are fixed (assuming a Frequentist approach where $\theta$ is not a random variable). — danijar, Jun 11 '17 at 22:34
Closely related: https://stats.stackexchange.com/questions/224037/wikipedia-entry-on-likelihood-seems-ambiguous — Tim, Jun 12 '17 at 07:50

Glen_b · Accepted Answer · 2019-11-12T23:30:01.677

Likelihood is a function of $\theta$, given $x$, while $P$ is a function of $x$, given $\theta$.

The likelihood function is not a density (or pmf) -- it doesn't integrate (/sum) to 1.
Indeed, $\mathcal L$ may be continuous while $P$ is discrete (e.g. likelihood for a binomial parameter) or vice-versa (e.g. likelihood for an Erlang distribution with unit rate parameter but unspecified shape)

Imagine a bivariate function of a single potential observation $x$ (say a Poisson count) and a single parameter (e.g. $\lambda$) -- in this example discrete in $x$ and continuous in $\lambda$ -- then when you slice that bivariate function of $(x,\lambda)$ one way you get $p_\lambda(x)$ (each slice gives a different pmf) and when you slice it the other way you get $\mathcal L_x(\lambda)$ (each a different likelihood function).

(That bivariate function simply expresses the way $x$ and $\lambda$ are related via your model)

[Alternatively, consider a discrete $\theta$ and a continuous $x$; here the likelihood is discrete and the density continuous.]

As soon as you specify $x$, you identify a particular $\mathcal L$, that we call the likelihood function of that sample. It tells you about $\theta$ for that sample -- in particular what values had more or less likelihood of giving that sample.

Likelihood is a function that tell you about the relative chance (in that ratios of likelihoods can be thought of as ratios of probabilities of being in $x+dx$) that this value of $\theta$ could produce your data.

It's *not* a density. For any given $\theta$ its value is equal to that of a density evaluated at a specific $x$, but it's equal to a *different* density at every $\theta$. Imagine you took every possible value of $\theta$ (imagine for the moment a discrete $\theta$ but continuous $p$) and for each one, you drew the probability density, $p$). Then at the specific sample value ($x$), you slice orthogonally across all those different densities. That slice is a likelihood function --- and *it is not itself a density*. — Glen_b, Nov 12 '19 at 22:45
The second thing. For each specific value of $\theta$ and a given $x$, $L$ is equal to the value of the density evaluated at that $x$, given that $\theta$. but the density changes with $\theta$, so $L$ is equal to the value of a *different* density (each evaluated at $x$) at every point on $L$. — Glen_b, Nov 12 '19 at 23:27
@Glen_b-ReinstateMonica I know its bit of an ask but can you draw pictorially what you mentioned about taking a slice in an instant so we can create same mental picture of what you have? — GENIVI-LEARNER, Jan 23 '20 at 14:37

Lerner Zhang · Answer 2 · 2020-04-21T13:17:52.947

According to the Bayesian theory, $f(\theta|x_1,...,x_n) = \frac{f(x_1,...,x_n|\theta) * f(\theta)}{f(x_1,...,x_n)}$ holds, that is $\text{posterior} = \frac{\text{likelihood} * \text{prior}}{evidence}$.

Notice that the maximum likelihood estimate omits the prior beliefs(or defaults it to zero-mean Gaussian and count on it as the L2 regularization or weight decay) and treats the evidence as constant(when calculating the partial derivative with respect to $\theta$).

It tries to maximize the likelihood by adjusting $\theta$ and just treating $f(\theta|x_1,...,x_n)$ equal to $f(x_1,...,x_n|\theta)$ which we can easily get(usually the loss) and keep the likelihood as $\mathcal{L}(\theta|\mathbf x)$. The true probability $\frac{f(x_1,...,x_n|\theta) * f(\theta)}{f(x_1,...,x_n)}$ can hardly be worked out because the evidence(the denominator), $\int_{\theta} f(x_1, ...,x_n, \theta)d\theta$, is intractable.

Hope this helps.

jwyao · Answer 3 · 2017-06-11T22:42:17.487

2

I agree with @Big Agnes. Here is what my professor taught in class: One way is to think of likelihood function $L(\theta | \mathbf{x})$ as a random function which depends on the data. Different data gives us different likelihood functions. So you may say conditioning on data. Given a realization of data, we want to find a $\hat{\theta}$ such that $L(\theta | \mathbf{x})$ is maximized or you can say $\hat{\theta}$ is most consistent with data. This is same to say we maximize "observed probability" $P (\mathbf{x} | \theta)$. We use $P(\mathbf{x} | \theta)$ to do calculation but it is different from $P(\mathbf{X} | \theta)$. Small $\mathbf{x}$ stands for observed values, while $\mathbf{X}$ stands for random variable. If you know $\theta$, then $P(\mathbf{x} | \theta)$ is the probability/ density of observing $\mathbf{x}$.

edited Jun 11 '17 at 22:42

answered Jun 11 '17 at 22:36

jwyao

236
2
5

Thanks. Could I equivalently use $P(x|\theta)$ (with lowercase x) instead of $\mathcal{L}(\theta|x)$? When we write ${max}_\theta{P(x|\theta)}$ it should be clear that $x$ is fixed and we're trying to find the most consistent $\theta$. Or does $\mathcal{L}(\theta|x)$ refer to something more abstract that has a different implementation in some situations? – danijar Jun 11 '17 at 22:52
Also, could you elaborate why $\mathcal{L}(\theta|x)$ is a random function? It seems like it should be deterministic since both $x$ and $\theta$ are fixed (unless we give $\theta$ a Bayesian treatment and consider it a random variable). – danijar Jun 11 '17 at 22:54
It is better to use $L(\theta | \mathbf{x})$ (actually likelihood is defined in such way), because it is a function of $\theta$ rather than $\mathbf{x}$. I don't know if $L(\theta | \mathbf{x})$ refers to something abstract. As for random function argument, it is just a way of thinking of likelihood function. The true $\theta$ is fixed, but we don't know it. That's why we estimate it. You plug in your observations into $L(\theta | \mathbf{x})$, and different data gives you different functions. So likelihood function depends on your observation, so it is like a function of random variables. – jwyao Jun 11 '17 at 23:09
$L(\theta | \mathbf{x})$ looks like a posterior distribution but in fact, it isn't. There is no assumption on the (prior) distribution $\pi (\theta)$ of $\theta$. – jwyao Jun 11 '17 at 23:11
So one could write $\mathcal{L}_x(\theta)$ to express this more clearly? (I know we shouldn't write this in practice since it's not common notation.) – danijar Jun 11 '17 at 23:17
My guess is yes. Your notation states it's a function of $\theta$ clearly. But as you said, in practice $L(\theta | \mathbf{x})$ is what people use. I think it is a common notation. You may want to look at some standard statistics textbooks, like [link](https://www.amazon.com/Statistical-Inference-George-Casella/dp/0534243126/ref=sr_1_2?ie=UTF8&qid=1497223253&sr=8-2&keywords=statistical+inference+casella). – jwyao Jun 11 '17 at 23:22

score 1 · Answer 4 · answered Jun 12 '17 at 01:46

I think the other answers given by jwyao and Glen_b are quite good. I just wanted to add a very simple example which is too long for a comment.

Consider one observation $X$ from a Bernoulli distribution with probability of success $\theta$. With $\theta$ fixed (known or unknown), the distribution of $X$ is given by $p(X|\theta)$.

$$P(x|\theta) = \theta^x(1-\theta)^{1-x}$$

In other words, we know that $P(X=1) = 1 - P(X=0) = \theta$.

Alternatively, we could look treat the observation as fixed and view this as a function of $\theta$.

$$L(\theta | x) = \theta^x(1-\theta)^{1-x}$$

For example, in a maximum likelihood setting, we seek to find $\theta$ which maximizes the likelihood as a function of $\theta$. For example, if we observe $X = 1$, then the likelihood becomes

$$L(\theta | x) = \begin{cases} \theta, & \theta \in [0,1] \\ 0, & else \end{cases}$$

and we see that the MLE would be $\hat\theta = 1$.

Not sure that I've really added any value to the discussion, but I just wanted to give a simple example of the different ways of viewing the same function.

Why do people use $\mathcal{L}(\theta|x)$ for likelihood instead of $P(x|\theta)$?

4 Answers4

Linked