25

I'm reading a paper where the authors are leading from a discussion of maximum likelihood estimation to Bayes' Theorem, ostensibly as an introduction for beginners.

As a likelihood example, they start with a binomial distribution:

$$p(x|n,\theta) = \binom{n}{x}\theta^x(1-\theta)^{n-x}$$

and then log both sides

$$\ell(\theta|x, n) = x \ln (\theta) + (n-x)\ln (1-\theta)$$

with the rationale that:

"Because the likelihood is only defined up to a multiplicative constant of proportionality (or an additive constant for the log-likelihood), we can rescale ... by dropping the binomial coefficient and writing the log-likelihood in place of the likelihood"

The math makes sense, but I can't understand what is meant by "the likelihood is only defined up to a multiplicative constant of proportionality" and how this allows dropping the binomial coefficient and going from $p(x|n,\theta)$ to $\ell(\theta|x,n)$.

Similar terminology has come up in other questions (here and here), but it still not clear what, practically, likelihood being defined or bringing information up to a multiplicative constant means. Is it possible to explain this in layman's terms?

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
kmm
  • 400
  • 10
  • 19

5 Answers5

25

The point is that sometimes, different models (for the same data) can lead to likelihood functions which differ by a multiplicative constant, but the information content must clearly be the same. An example:

We model $n$ independent Bernoulli experiments, leading to data $X_1, \dots, X_n$, each with a Bernoulli distribution with (probability) parameter $p$. This leads to the likelihood function $$ \prod_{i=1}^n p^{x_i} (1-p)^{1-x_i} $$ Or we can summarize the data by the binomially distributed variable $Y=X_1+X_2+\dotsm+X_n$, which has a binomial distribution, leading to the likelihood function $$ \binom{n}{y} p^y (1-p)^{n-y} $$ which, as a function of the unknown parameter $p$, is proportional to the former likelihood function. The two likelihood functions clearly contains the same information, and should lead to the same inferences!

And indeed, by definition, they are considered the same likelihood function.


Another viewpoint: observe that when the likelihood functions are used in Bayes theorem, as needed for bayesian analysis, such multiplicative constants simply cancel! so they are clearly irrelevant to bayesian inference. Likewise, it will cancel when calculating likelihood ratios, as used in optimal hypothesis tests (Neyman-Pearson lemma.) And it will have no influence on the value of maximum likelihood estimators. So we can see that in much of frequentist inference it cannot play a role.


We can argue from still another viewpoint. The Bernoulli probability function (hereafter we use the term "density") above is really a density with respect to counting measure, that is, the measure on the non-negative integers with mass one for each non-negative integer. But we could have defined a density with respect to some other dominating measure. In this example this will seem (and is) artificial, but in larger spaces (function spaces) it is really fundamental! Let us, for the purpose of illustration, use the specific geometric distribution, written $\lambda$, with $\lambda(0)=1/2$, $\lambda(1)=1/4$, $\lambda(2)=1/8$ and so on. Then the density of the Bernoulli distribution with respect to $\lambda$ is given by $$ f_{\lambda}(x) = p^x (1-p)^{1-x}\cdot 2^{x+1} $$ meaning that $$ P(X=x)= f_\lambda(x) \cdot \lambda(x) $$ With this new, dominating, measure, the likelihood function becomes (with notation from above) $$ \prod_{i=1}^n p^{x_i} (1-p)^{1-x_i} 2^{x_i+1} = p^y (1-p)^{n-y} 2^{y+n} $$ note the extra factor $2^{y+n}$. So when changing the dominating measure used in the definition of the likelihood function, there arises a new multiplicative constant, which does not depend on the unknown parameter $p$, and is clearly irrelevant. That is another way to see how multiplicative constants must be irrelevant. This argument can be generalized using Radon-Nikodym derivatives (as the argument above is an example of.)

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
  • "the information content must clearly be the same" This is only true if you believe in the likelihood principle! – jsk May 16 '14 at 05:30
  • Yes, maybe, but I did show how it follows from bayesian principles. – kjetil b halvorsen May 16 '14 at 12:38
  • 1
    @kjetilbhalvorsen Thank you for the thoughtful answer! One thing I'm still confused about is why the likelihood of bernoulli distribution doesn't include a binomial coefficient. Your answer makes it clear why it doesn't matter, but I don't understand why it's left off of the likelihood in the first place. – jvans Apr 30 '18 at 14:43
  • @jvans: It's because the binomial coefficient do not depend on the unknown parameter, so cannot influence the shape of the likelihood function – kjetil b halvorsen May 03 '18 at 12:54
  • @jvans to asnwer this: "One thing I'm still confused about is why the likelihood of bernoulli distribution doesn't include a binomial coefficient.". Intuitively, the bernoulli likelihood assigns probability to a specific order of the sequence of trials (e.g., 0011). However, the binomial likelihood assigns probability to the total number of successes (number of 1's in the sequence), that is $\sum_i x_i$. The order here does not matter (sums are order-invariant) so you have to count all the possible combinations that would have generated the 2 successes in 0011. The bin coeff does the count – Pietro Jun 09 '21 at 15:35
12

It basically means that only relative value of the PDF matters. For instance, the standard normal (Gaussian) PDF is: $f(x)=\frac{1}{\sqrt{2\pi}}e^{-x^2/2}$, your book is saying that they could use $g(x)=e^{-x^2/2}$ instead, because they don't care for the scale, i.e. $c=\frac{1}{\sqrt{2\pi}}$.

This happens because they maximize likelihood function, and $c\cdot g(x)$ and $g(x)$ will have the same maximum. Hence, maximum of $e^{-x^2/2}$ will be the same as of $f(x)$. So, they don't bother about the scale.

Aksakal
  • 55,939
  • 5
  • 90
  • 176
7

I cannot explain the meaning of the quotation, but for maximum-likelihood estimation, it does not matter whether we choose to find the maximum of the likelihood function $L(\mathbf x; \theta)$ (regarded as a function of $\theta$ or the maximum of $aL(\mathbf x; \theta)$ where $a$ is some constant. This is because we are not interested in the maximum value of $L(\mathbf x; \theta)$ but rather the value $\theta_{\text{ML}}$ where this maximum occurs, and both $L(\mathbf x; \theta)$ and $aL(\mathbf x; \theta)$ achieve their maximum value at the same $\theta_{\text{ML}}$. So, multiplicative constants can be ignored. Similarly, we could choose to consider any monotone function $g(\cdot)$ (such as the logarithm) of the likelihood function $L(\mathbf x; \theta)$, determine the maximum of $g(L(\mathbf x;\theta))$, and infer the value of $\theta_{\text{ML}}$ from this. For the logarithm, the multipliative constant $a$ becomes the additive constant $\ln(a)$ and this too can be ignored in the process of finding the location of the maximum: $\ln(a)+\ln(L(\mathbf x; \theta)$ is maximized at the same point as $\ln(L(\mathbf x; \theta)$.

Turning to maximum a posteriori probability (MAP) estimation, $\theta$ is regarded as a realization of a random variable $\Theta$ with a priori density function $f_{\Theta}(\theta)$, the data $\mathbf x$ is regarded as a realization of a random variable $\mathbf X$, and the likelihood function is considered to be the value of the conditional density $f_{\mathbf X\mid \Theta}(\mathbf x\mid \Theta=\theta)$ of $\mathbf X$ conditioned on $\Theta = \theta$; said conditional density function being evaluated at $\mathbf x$. The a posteriori density of $\Theta$ is $$f_{\Theta\mid \mathbf X}(\theta \mid \mathbf x) = \frac{f_{\mathbf X\mid \Theta}(\mathbf x\mid \Theta=\theta)f_\Theta(\theta)}{f_{\mathbf X}(\mathbf x)} \tag{1}$$ in which we recognize the numerator as the joint density $f_{\mathbf X, \Theta}(\mathbf x, \theta)$ of the data and the parameter being estimated. The point $\theta_{\text{MAP}}$ where $f_{\Theta\mid \mathbf X}(\theta \mid \mathbf x)$ attains its maximum value is the MAP estimate of $\theta$, and, using the same arguments as in the paragraph, we see that we can ignore $[f_{\mathbf X}(\mathbf x)]^{-1}$ on the right side of $(1)$ as a multiplicative constant just as we can ignore multiplicative constants in both $f_{\mathbf X\mid \Theta}(\mathbf x\mid \Theta=\theta)$ and in $f_\Theta(\theta)$. Similarly when log-likelihoods are being used, we can ignore additive constants.

Dilip Sarwate
  • 41,202
  • 4
  • 94
  • 200
  • This line of thinking can be done via bayes also: If you put $L$ or $aL$ into Bayes' theorem doesn't matter, the $a$ will cancel so the posterior is the same. – kjetil b halvorsen Nov 20 '18 at 14:24
5

In layman's terms, you'll often look for the maximum likelihood and $f(x)$ and $kf(x)$ share the same critical points.

Sergio
  • 5,628
  • 2
  • 11
  • 27
  • 3
    So do $f(x)$ and $f(x)+2$ but they would not be equivalent likelihood functions – Henry May 13 '14 at 18:18
  • Please, as Alecos Papadopoulos writes in his answer, "the likelihood is first a joint probability density function". Because of the iid assumption for random samples, that joint function is a _product_ of simple density functions, so multiplicative factors do arise, addends do not. – Sergio May 13 '14 at 18:36
  • 1
    The joint function is such a product if and only if the data are independent. But MLE extends to dependent variables, so the product argument appears unconvincing. – whuber Feb 20 '17 at 20:32
1

I would suggest not to drop from sight any constant terms in the likelihood function (i.e. terms that do not include the parameters). In usual circumstances, they do not affect the $\text {argmax}$ of the likelihood, as already mentioned. But:

There may be unusual circumstances when you will have to maximize the likelihood subject to a ceiling -and then you should "remember" to include any constants in the calculation of its value.

Also, you may be performing model selection tests for non-nested models, using the value of the likelihood in the process -and since the models are non-nested the two likelihoods will have different constants.

Apart from these, the sentence

"Because the likelihood is only defined up to a multiplicative constant of proportionality (or an additive constant for the log-likelihood)"

is wrong, because the likelihood is first a joint probability density function, not just "any" objective function to be maximized.

Alecos Papadopoulos
  • 52,923
  • 5
  • 131
  • 241
  • 3
    Hmmm... When wearing a Bayesian hat, I always thought of the likelihood function as the _conditional_ density function of the data given the parameter and not as a _joint_ probability density function. The location of the maximum of the joint probability density of the data and the parameter (regarded as a function of the unknown parameter $\theta$; the data being fixed) gives the maximum _a posteriori_ probability (MAP) estimate of $\theta$, does it not? – Dilip Sarwate May 13 '14 at 16:21
  • @DilipSarwate No objection to that -but here too we are looking at a _density_, which must sum up to unity. Then, constants are indispensable for properly defining it, and so I still think that the expression "is only _defined_ up to a proportionality constant" is wrong... I guess it is more careless writing rather than anything else... I suspect the authors may have been "affected" by how we find the posterior density, i.e. by "ignoring" the constants in the product "conditional density $\times$ prior". – Alecos Papadopoulos May 13 '14 at 16:28
  • If it is wrong, you have to consult some actual definition ... the actual definitions I have seen, includes that term! Your point about model selection might be an argument that that definition is not usefull... (definition, per se, are not right/wrong, but rather useful/not usefull) – kjetil b halvorsen May 13 '14 at 17:37
  • @kjetilbhalvorsen I have access to many books that define a density function. For a function to be treated as a density, it must integrate to unity. The likelihood function _is_ a density function, viewed as a function of the parameters. As a function of the parameters it does not necessarily integrate to unity over the parameter space. Still the sentence "it is _only_ defined up to etc", remains wrong, or at best, uselessly confusing. There is no need to write this sentence in order to "drop" this proportionality constant, which in any case, it may be risky, as I explained in my answer. – Alecos Papadopoulos May 13 '14 at 17:46
  • 3
    I think you need to be a bit more careful with the language. The likelihood is a function of the parameters for a fixed sample, but is equivalent to the joint density over the **sample space**. That is, $$L(\boldsymbol \theta \mid \boldsymbol x) = f(\boldsymbol x \mid \boldsymbol \theta).$$ This will integrate to $1$ over the sample space, but is not necessarily $1$ when integrated over the parameter space. When you say "the likelihood is a density, viewed as a function of the parameters," that makes it sound as if you mean "density with respect to the parameters," which it isn't. – heropup May 13 '14 at 19:09
  • 1
    @heropup I have already wrote that it doesn't necessarily integrate to unity over the parameter space, and so, immediately, it cannot be considered as a "density function" when it is viewed as a "function of the parameters". – Alecos Papadopoulos May 13 '14 at 19:13
  • 1
    Yes, I know. My point is that the phrase "The likelihood function is a density function, viewed as a function of the parameters" is itself confusing. It would be more precise to say something like, "The likelihood function is a function of the parameters for a fixed sample, and is equivalent (or proportional) to the joint density over the sample space." – heropup May 13 '14 at 19:15
  • @heropup Certainly, that would indeed be much more precise. – Alecos Papadopoulos May 13 '14 at 19:23
  • 1
    @heropup Your desired statement that "The likelihood function ... is equivalent (or proportional) to the joint density over the sample space" would indeed be much more precise but equally incorrect. The likelihood function is **neither equivalent nor proportional to the joint density** because the "coefficient of proportionality" is _not_ a constant (unless the prior distribution of the unknown parameter is uniformly distributed over an interval). The joint density is $L(x\mid \theta)f(\theta)$ where $L$ is the likelihood and $f(\theta)$ is the prior distribution of the parameter. – Dilip Sarwate May 13 '14 at 22:09
  • @DilipSarwate I see your point, but we're talking about different densities. You're talking about a joint density over the space of both the sample and the parameters. I'm talking about a conditional density over the sample space for a fixed (but unknown) set of parameters. That is after all how we construct a likelihood, such as in maximum likelihood estimation. If I say I have $n$ IID observations from an exponential distribution with unknown parameter $\lambda$, to get an MLE based on this sample, I don't need to impose a prior on $\lambda$ to write a likelihood. – heropup May 13 '14 at 22:46
  • 1
    Alecos Papadopoulos: I understand you do not like the standard definition, but it is still the standard definition. In my answer I explained the thinking behind that choice of definition! You bring to market another argument: the "constant of proportionality" might be of interest for model choice---like, making AIC comparable across different model families. Somebody used that argument and asked for R's calculated likelihoods to incorporate those constants. That was rejected, because the R Gods (like B Ripley) dont believe in that argument. – kjetil b halvorsen May 14 '14 at 10:51
  • @kjetilbhalvorsen +1 for the "R Gods". – Alecos Papadopoulos May 14 '14 at 14:00