32

The likelihood could be defined by several ways, for instance :

  • the function $L$ from $\Theta\times{\cal X}$ which maps $(\theta,x)$ to $L(\theta \mid x)$ i.e. $L:\Theta\times{\cal X} \rightarrow \mathbb{R} $.

  • the random function $L(\cdot \mid X)$

  • we could also consider that the likelihood is only the "observed" likelihood $L(\cdot \mid x^{\text{obs}})$

  • in practice the likelihood brings information on $\theta$ only up to a multiplicative constant, hence we could consider the likelihood as an equivalence class of functions rather than a function

Another question occurs when considering change of parametrization: if $\phi=\theta^2$ is the new parameterization we commonly denote by $L(\phi \mid x)$ the likelihood on $\phi$ and this is not the evaluation of the previous function $L(\cdot \mid x)$ at $\theta^2$ but at $\sqrt{\phi}$. This is an abusive but useful notation which could cause difficulties to beginners if it is not emphasized.

What is your favorite rigorous definition of the likelihood ?

In addition how do you call $L(\theta \mid x)$ ? I usually say something like "the likelihood on $\theta$ when $x$ is observed".

EDIT: In view of some comments below, I realize I should have precised the context. I consider a statistical model given by a parametric family $\{f(\cdot \mid \theta), \theta \in \Theta\}$ of densities with respect to some dominating measure, with each $f(\cdot \mid \theta)$ defined on the observations space ${\cal X}$. Hence we define $L(\theta \mid x)=f(x \mid \theta)$ and the question is "what is $L$ ?" (the question is not about a general definition of the likelihood)

gung - Reinstate Monica
  • 132,789
  • 81
  • 357
  • 650
Stéphane Laurent
  • 17,425
  • 5
  • 59
  • 101
  • 2
    (1) Because $\int L(\theta|x)dx = 1$ for all $\theta$, I believe even the constant in $L$ is defined. (2) If you think of parameters like $\phi$ and $\theta$ as merely being *coordinates* for a manifold of distributions, then change of parameterization has no intrinsic mathematical meaning; it's merely a change of description. (3) Native English speakers would more naturally say "likelihood *of* $\theta$" rather than "on." (4) The clause "when $x$ is observed" has philosophical difficulties, because most $x$ will never be observed. Why not just say "likelihood of $\theta$ given $x$"? – whuber Jun 02 '12 at 16:19
  • 1
    @whuber: For (1), I don't think the constant is well-defined. See ET Jaynes's book where he writes: "that a likelihood is not a probability because its normalization is arbitrary." – Neil G Jun 02 '12 at 17:31
  • 3
    You appear to be confusing two kinds of normalization, Neil: Jaynes was referring to normalization by integration over $\theta$, not $x$. – whuber Jun 02 '12 at 17:32
  • @whuber: Why does that matter? If it's scale-invariant, it should be scale invariant integrating over either $x$ or $\theta$? Concretely, we could say that a Bernoulli random variable with bias $\theta$ induces a likelihood $L(\theta \mid x)=k(1-\theta)^{1-x}\theta^x$. $k$ doesn't matter since the ratio of likelihoods is always right. – Neil G Jun 02 '12 at 17:40
  • @Neil, if the *only* thing you use likelihoods for is MLE, then fine. But they are used for other things, such as computing the [Cramer-Rao bound](http://en.wikipedia.org/wiki/Cram%C3%A9r%E2%80%93Rao_bound). The constant will be crucial in such applications. – whuber Jun 02 '12 at 23:37
  • 1
    @whuber: I don't think a scaling factor will matter for the Cramer-Rao bound because changing $k$ adds a constant amount to the log-likelihood, which then disappears when the partial derivative is taken. – Neil G Jun 03 '12 at 00:04
  • 1
    I agree with Neil, I do not see any application where the constant plays a role – Stéphane Laurent Jun 03 '12 at 06:14
  • @whuber: (4) "given $x$" is not defined in non-Bayeian settings since $\theta$ is then a constant, not a random variable; (1) as shown by the Likelihood Principle, two proportional likelihoods bring the _same_ information about $\theta$ so I also believe that a likelihood _cannot_ be normalised. – Xi'an Jun 16 '12 at 08:22
  • @Xi'an thank you, but we are talking about different things here. I am neither treating $\theta$ as a random variable, nor am I suggesting the likelihood be integrated over $\theta$ ("normalized"). For each value of $\theta$, $L(x|\theta)$ is a probability distribution, period. Please see [this post](http://stats.stackexchange.com/a/29695) for a definition of $L$ that makes it *unique,* not just multiplicatively. If you do not define $L$ uniquely, then how can you possibly compare two optimal values of $L$ when comparing models? – whuber Jun 16 '12 at 11:21
  • I agree that the normalization constant should play a role when comparing non-nested models, but I am ignorant about such model comparisons. – Stéphane Laurent Jun 16 '12 at 13:27
  • @whuber: (1) The Cramer-Rao lower bound and Fisher information matrix do not depend on the normalising constant because of the log; (2) The likelihoods associated with a sample and with the corresponding sufficient statistic (assuming it exists) bring the same amount of information on $\theta$, but they are only proportional; (3) you should not confuse the likelihood (as a function of $\theta$ with $x$ fixed) with the density (as a function of $x$ with $\theta$ fixed) – Xi'an Jun 16 '12 at 19:43
  • @whuber: (4) None of the above and below discussion accounts for Birnbaum's [Likelihood Principle](http://bit.ly/y1gORL).. – Xi'an Jun 16 '12 at 19:58
  • @Xi'an (1) is right; (2) appears irrelevant to my point; and (3) just flabbergasts me: where, in anything I have ever written on this site, do you see even a suggestion that I would take the likelihood to be a density over $\theta$? The usual setup doesn't even specify a sigma algebra for $\Omega$, so we can hardly get started with integration. I am trying hard to disabuse others of that notion! (4) On p. 19 it is clear that the proportionality extends *only* for a fixed $x$: that is, at the stage where an optimal $\theta$ is sought for a given $x$, there is *much* freedom to alter $L$. – whuber Jun 16 '12 at 20:07
  • @whuber: (3) _this is not what I meant_ so, pardon my French!, but the likelihood was introduced by Fisher as a function of $\theta$ indexed by $x$ exactly to distinguish it from the sampling density $f(x|\theta)$, which is a density in $x$ not in $\theta$. – Xi'an Jun 16 '12 at 20:14
  • @Xi'an: the likelihood principle is not a theorem; hence we cannot claim something is wrong or right in the name of the likelihood principle – Stéphane Laurent Jun 17 '12 at 04:44
  • In fact the definition is implicitely stated up to a constant if we do not attach importance to the dominating measure. – Stéphane Laurent Jun 17 '12 at 17:13
  • 1
    See the following paper for a very thourough, modern discussion: > Bjørnstad, J. F. (1996). [On the Generalization of the Likelihood Function and the Likelihood Principle](http://www.jstor.org/stable/2291674). *Journal of the American Statistical Association* **91**: 791-806. – kjetil b halvorsen Jun 17 '12 at 23:24

3 Answers3

13

Your third item is the one I have seen the most often used as rigorous definition.

The others are interesting too (+1). In particular the first is appealing, with the difficulty that the sample size not being (yet) defined, it is harder to define the "from" set.

To me, the fundamental intuition of the likelihood is that it is a function of the model + its parameters, not a function of the random variables (also an important point for teaching purposes). So I would stick to the third definition.

The source of the abuse of notation is that the "from" set of the likelihood is implicit, which is usually not the case for well defined functions. Here, the most rigorous approach is to realize that after the transformation, the likelihood relates to another model. It is equivalent to the first, but still another model. So the likelihood notation should show which model it refers to (by subscript or other). I never do it of course, but for teaching, I might.

Finally, to be consistent with my previous answers, I say the "likelihood of $\theta$" in your last formula.

gui11aume
  • 13,383
  • 2
  • 44
  • 89
  • Thanks. And what is your advice about the equality up to a multiplicative constant ? – Stéphane Laurent Jun 02 '12 at 12:15
  • Personally I prefer to call it up when needed rather than hard code it in the definition. And think that for model selection/comparison this 'up-to-a-multiplicative-constant' equality does not hold. – gui11aume Jun 02 '12 at 12:21
  • Ok. Concerning the name, you could imagine you discuss about the likelihoods $L(\theta\mid x_1)$ and $L(\theta\mid x_2)$ for two possibles observations. In such a case, would you say "the likelihood of $\theta$ when $x_1$ observed", or "the likelihood of $\theta$ for the observation $x_1$", or something else ? – Stéphane Laurent Jun 02 '12 at 14:11
  • Oh, in that case I just say "given $x_1$". – gui11aume Jun 02 '12 at 14:20
  • I didn't understand the part about the change of parameters and $\sqrt{\theta}$. Could you please explain? or point to a reference? – Pardis Jun 02 '12 at 14:31
  • 1
    If you re-parametrize your model with $\phi = \theta^2$ you _actually_ compute the likelihood as a composition of functions $L(.|x) \circ g(.)$ where $g(y) = y^2$. In this case, $g$ goes from $R$ to $R^+$ so the set of definition (mentioned as "from" set) of the likelihood is no longer the same. You could call the first function $L_1(.|)$ and the second $L_2(.|)$ because they are not the same functions. – gui11aume Jun 02 '12 at 14:57
  • 1
    How is the third definition rigorous? And what is the problem with the sample size not being defined? Since we say $P(x_1, x_2, \dotsc, x_n \mid \theta)$, which naturally brings into existence a corresponding sigma algebra for the sample space $\Omega^n$, why can't we have the parallel definition for likelihoods? – Neil G Jun 02 '12 at 16:13
8

I think I would call it something different. Likelihood is the probability density for the observed x given the value of the parameter $θ$ expressed as a function of $θ$ for the given $x$. I don't share the view about the proportionality constant. I think that only comes into play because maximizing any monotonic function of the likelihood gives the same solution for $θ$. So you can maximize $cL(θ∣x)$ for $c>0$ or other monotonic functions such as $\log(L(θ∣x))$ which is commonly done.

Macro
  • 40,561
  • 8
  • 143
  • 148
Michael R. Chernick
  • 39,640
  • 28
  • 74
  • 143
  • 4
    Not only the maximization: the up-to-proportionality also comes into play in the likelihood ratio notion, and in Bayes formula for Bayesian statistics – Stéphane Laurent Jun 02 '12 at 13:41
  • I thought someone might downvote my answer. But I think it is quite reasonable to define likelihood this way as a definitive probability without calling anything proprotional to it a likelihood. @StéphaneLaurent to your comment about priors, if the function is integrable it can be normalized to a density. The posterior is proportional to the likelihood times the prior. Since the posterior must be normalized by dividing by an integral we might as well specify the prior to be the distribution. It is only in an extended sense that this gets applied to improper priors. – Michael R. Chernick Jun 02 '12 at 16:35
  • 1
    I'm not quite sure why someone would *downvote* this answer. It seems you are trying to respond more to the OP's second and questions than the first. Perhaps that was not entirely clear to other readers. Cheers. :) – cardinal Jun 02 '12 at 16:50
  • @Michael I don't see the need to downvote this answer too. Concerning noninformative priors (this is another discussion and) I intend to open a new disucssion about this subject. I will not do it soon, because I am not easy with English, and this is more difficult for me to write "philosophy" than mathematics. – Stéphane Laurent Jun 02 '12 at 17:04
  • 1
    @Stephane: If you'd like, please consider posting your other question directly in French. We have several native French speakers on this site that likely would help translate any passages you're unsure about. This includes a moderator and also an editor of one of the very top English-language statistics journals. I look forward to the question. – cardinal Jun 02 '12 at 17:35
  • Thank you cardinal. But writing in english is a good exercise for me. And I am motivated to do this exercise because I really like stats.stackexchange.com, so I prefer to do it :) – Stéphane Laurent Jun 02 '12 at 17:38
  • @Stephane: sounds good. I just wanted to make sure and mention that as an option because I didn't want to miss the potential for a very interesting question simply due to language concerns. (I should mention that I think you express yourself very well, so you should feel more confident than you appear to!) I find the wide range of backgrounds of the users of to be quite stimulating and refreshing. I, for one, am glad to see you participating so actively. Cheers. – cardinal Jun 02 '12 at 17:47
  • Again, I think the proportionality is inherent to the definition of the likelihood because, outside the Bayesian perspective, there is no unambiguous reference measure on the parameter space. – Xi'an Jun 16 '12 at 08:26
6

Here's an attempt at a rigorous mathematical definition:

Let $X: \Omega \to \mathbb R^n$ be a random vector which admits a density $f(x | \theta_0)$ with respect to some measure $\nu$ on $\mathbb R^n$, where for $\theta \in \Theta$, $\{f(x|\theta): \theta \in \Theta\}$ is a family of densities on $\mathbb R^n$ with respect to $\nu$. Then, for any $x \in \mathbb R^n$ we define the likelihood function $L(\theta | x)$ to be $f(x | \theta)$; for clarity, for each $x$ we have $L_x : \Theta \to \mathbb R$. One can think of $x$ to be a particular potential $x_{obs}$ and $\theta_0$ to be the "true" value of $\theta$.

A couple of observations about this definition:

  1. The definition is robust enough to handle discrete, continuous, and other sorts of families of distributions for $X$.
  2. We are defining the likelihood at the level of density functions instead of at the level of probability distributions/measures. The reason for this is that densities are not unique, and it turns out that this isn't a situation where one can pass to equivalence classes of densities and still be safe: different choices of densities lead to different MLE's in the continuous case. However, in most cases there is a natural choice of family of densities that are desirable theoretically.
  3. I like this definition because it incorporates the random variables we are working with into it and, by design since we have to assign them a distribution, we have also rigorously built in the notion of the "true but unknown" value of $\theta$, here denoted $\theta_0$. For me, as a student, the challenge of being rigorous about likelihood was always how to reconcile the real world concepts of a "true" $\theta$ and "observed" $x_{obs}$ with the mathematics; this was often not helped by instructors claiming that these concepts weren't formal but then turning around and using them formally when proving things! So we deal with them formally in this definition.
  4. EDIT: Of course, we are free to consider the usual random elements $L(\theta | X)$, $S(\theta | X)$ and $\mathcal I(\theta | X)$ and under this definition with no real problems with rigor as long as you are careful (or even if you aren't if that level of rigor is not important to you).
guy
  • 7,737
  • 1
  • 26
  • 50
  • Yes, I should have precised that I only consider statistical models given by a family $\{f(x|\theta): \theta \in \Theta\}$ of Radon-Nikodym derivatives ("densities") with respect to some dominating measure $\nu$, and my question is about the nature of $L$ defined by $L(\theta \mid x) = f(x \mid \theta)$. – Stéphane Laurent Jun 02 '12 at 16:33
  • I do not get this measure theoretic remark about different versions of densities: the likelihood is a function of $\theta$ for _the observed_ data $x_{obs}$. You cannot start changing the versions at $x_{obs}$ _once_ $x_{obs}$ is observed... In other words, the probability that $x_{obs}$ belongs to a set where the versions differ is zero, because this set has measure zero... – Xi'an Jun 16 '12 at 19:47
  • @Xi'an I think your definition is too limited to be generally useful. One uses likelihoods not only for estimation but also for theoretical considerations in which *all* possible outcomes have to be contemplated. That is why guy takes some care to assure there is some commonality among the densities within a family. – whuber Jun 16 '12 at 19:56
  • @whuber: I would be most interested in an example where the version of the _density function_ matters for estimation purposes... Even in a frequentist perspective. – Xi'an Jun 16 '12 at 20:09
  • @Xi'an It begins to sound like you, guy, myself, and perhaps others may be operating with sufficiently different assumptions about the statistical setting, the objectives, and even the definitions of mathematical objects like densities that it may take quite a lot of discussion to straighten things out, and then I (strongly) suspect we will find no fundamental disagreement. This is a conversation that belongs in chat, because the commenting mechanism is too confining, laborious, (and distracting) to be helpful. – whuber Jun 16 '12 at 20:17
  • @Xi'an The most elementary situation in which choice of density matters is the Uniform$(0 , \theta)$. Pick the wrong version and the MLE doesn't even exist, regardless of which $x_{obs}$ you end up with. I think that is the minimal example of why the choice of density issue is worth mentioning in a rigorous setting, but it isn't *that* big of a deal. – guy Jun 17 '12 at 15:46
  • Uh? Can you explain any further? The uniform $\mathcal{U}(0,1)$ density is one over $\theta$ almost everywhere on $(0,\theta)$. Thanks! – Xi'an Jun 17 '12 at 16:09
  • 4
    @Xi'an Let $X_1, ..., X_n$ be uniform on $(0, \theta)$. Consider two densities $f_1 (x) = \theta^{-1} I[0 < x < \theta]$ versus $f_2 (x) = \theta^{-1} I[0 \le x \le \theta]$. Both $f_1$ and $f_2$ are valid densities for $\mathcal U(0, \theta)$, but under $f_2$ the MLE exists and is equal to $\max X_i$ whereas under $f_1$ we have $\prod _j f_1 (x_j| \max x_i) = 0$ so that if you set $\hat \theta = \max X_i$ you end up with a likelihood of $0$, and in fact the MLE doesn't exist because $\sup _\theta \prod _j f_1(x | \theta)$ is not attained for any $\theta$. – guy Jun 17 '12 at 21:12
  • 1
    @guy: thanks, I did not know about this interesting counter-example. – Xi'an Jun 18 '12 at 06:00
  • 1
    @guy You said that $\sup_\theta \prod_j f_1(x_j| \theta)$ is not attained for any $\theta$. However, this supremum is attained at some point as I show below: $$L_1(\theta;x) = \prod_{j=1}^n f_1(x_j|\theta) = \theta^{-n} \prod_{j=1}^n I\big(0 < x_j < \theta\big) = \theta^{-n}I\big(0< M < \theta\big),$$ where $M = \max \{x_1, \ldots, x_n\}$. I am assuming that $x_j > 0$ for all $j=1,\ldots,n$. It is simple to see that 1. $L_1(\theta;x) = 0$, if $0 – Alexandre Patriota Jan 02 '14 at 15:17
  • 1
    @guy: continuing... That is, $$L_1(\theta;x) \in \big[0,M^{-n}\big),$$ for all $\theta \in (0,\infty)$. We do not have a maximum value but the supremum does exist and it is given by $$\sup_{\theta \in (0,\infty)} L_1(\theta, x) = M^{-n}$$ and the argument is $$M = \arg\sup_{\theta \in (0,\infty)} L_1(\theta;x).$$ Perhaps, the usual asymptotics are not applied here and some other tolls should be employed. But, the supremum of $L_1(\theta;x)$ does exist or I missed some very basic concepts. – Alexandre Patriota Jan 02 '14 at 15:18
  • 1
    @AlexandrePatriota The supremum exists, obviously, but it is not attained by the function. I'm not sure what the notation $\arg \sup$ is supposed to mean - there is no argument of $L_1(\theta; x)$ which yields the $\sup$ because $L_1(\theta; M) = 0$. The MLE is defined as any $\hat \theta$ which attains the $\sup$ (typically) and no $\hat \theta$ attains the $\sup$ here. Obviously there are ways around it - the asymptotics we appeal to require that there *exists* a likelihood with such-and-such properties, and there does. It's just $L_2$ rather than $L_1$. – guy Jan 02 '14 at 19:42
  • @guy, You are right, under the typical definition of the MLE, it does not exist. I think I misread your post. The $\arg \sup$ from my previous post is ambiguous and must be properly defined. – Alexandre Patriota Jan 02 '14 at 20:51