What is the correct definition of the Likelihood function?

Question

I am doing the CS229:Machine Learning of Stanford Engineering Everywhere. All trhough the first chapter he uses

$$L(\theta) = P(Y | X; \theta)$$

i.e. the likelihood of the parameter $\theta$ is given by the cond. prob. of Y given X

Now in the second chapter, when talking about Gaussian Discriminant Analysis, shuddenly without any explaination our likelihood looks like this:

$$L(\theta) = P(Y \cap X; \theta)$$

What happened here? Which likelihood function is used when? I find the first likelihood a much more natural choice.

I am talking about page 10 of this script

**Conventions:** The notation $Y\cup X$ is quite unusual since $X$ and $Y$ are random variables. A more natural notation would be $P(Y,X|\theta)$. Note also that capital letters are usually reserved for random variables, which realisations are denoted with lower case letters. — Xi'an, Oct 01 '15 at 16:49

Paul · Answer 1 · 2015-10-01T17:23:17.710

The two likelihoods are related by the following equation: $$P(Y \cap X\,|\,\Theta) = P(Y\,|\,X,\Theta)P(X\,|\,\Theta)$$ So, the joint probability of $Y$ and $X$ has to account for two things:

The probability of generating $Y$ given $X$ and $\Theta$
The probability of generating $X$ given $\Theta$

$P(Y\,|\,X,\Theta)$ only accounts for (1), and would be preferred when you only care about predicting Y when X is known. The joint likelihood looks at the probability of generating both X and Y given model parameter $\Theta$. This could be valuable if you want your model to predict X as well as Y given X. Put another way, $P(X\,|\,\Theta)$ is a way of measuring to what extent your model knows what kinds of $X$ are likely to occur in your dataset.

score 4 · Accepted Answer · edited Apr 13 '17 at 12:44

Start with general definitions of likelihood. With likelihood you are not really interested in probabilities, but in likelihood of $\theta$ given your data. It is calculated using probability of data using some model with parameters $\theta$, i.e.

$$L(\theta|X) = \prod_i f_\theta(x_i)$$

Now, in your examples two different likelihoods are described. In the first case, you have a regression model of $Y$ conditional on $X$ and in the second case, you have a joint likelihood of $X$ and $Y$ in bivariate model. This is the same as you can have conditional probabilities and joint probabilities, they are both probabilities, so have the same properties, but describe different cases.

score 0 · Answer 3 · answered Oct 01 '15 at 09:45

It just seems that in the second case, $X$ and $Y$ are both modelled jointly in a generative model and you can write the joint likelihood as $P(X, Y | \theta)$

For example, now if you assume X and Y are independent, the joint log likelihood can be written as:

$$ \log L(\theta) = \log P(X |\theta) + \log P(Y|\theta) $$

What is the correct definition of the Likelihood function?

3 Answers3