0

In probabilistic machine learning, the likelihood of the data is usually computed as the product of the individual likelihoods of seeing each data point given the parameters $\theta$. In logistic regression, the likelihood of the data given the parameters $\theta$ is equal to $$P(Y|X,\theta) = \prod_{i=1}^m[\frac{1}{1+e^{-x_i\theta}}]^{y_i}[1-\frac{1}{1+e^{-x_i\theta}}]^{1-y_i}$$ This is just the product of $m$ bernouli distributions. I have two questions.

  1. What is the relation between a conditional density distribution and the likelihood that is used in baye's theorem ?
  2. Is the product of distributions in this setting, bernouli distributions going to result in a valid conditional distribution ?

EDIT

Regarding the first question, suppose I am able to define a joint distribution for $P(Y,\theta|X)$. This is assuming $\theta$ random variable is follows some distribution. Then, given a fixed data $Y$, we can slice the joint distribution at $Y=y$ and obtain the marginal density $P(\theta|Y=data,X)$. Then $argmax_{\theta}P(\theta|Y=data,X)$ represents the $\theta_{MLE}$. In this case the conditional probability is the likelihood function for $\theta$. Integrating this yields a value of 1. However, if $\theta$ does not follow some distribution then integrating this likelihood is not equal to 1 ?

Sycorax
  • 76,417
  • 20
  • 189
  • 313
calveeen
  • 746
  • 1
  • 10

1 Answers1

4

The entire (Bayesian and classical) analysis of a generalised linear model is conditional on the regressor vector $X=(x_1,\ldots,x_n)$.

The joint distribution of the $Y_i$'s in the logit model $$p(y|x,\theta) = \prod_{i=1}^m[\frac{1}{1+e^{-x_i\theta}}]^{y_i}[1-\frac{1}{1+e^{-x_i\theta}}]^{1-y_i}$$where $$y=(y_1,\ldots,y_n)\quad\text{and}\quad x=(x_1,\ldots,x_n)$$ is a valid joint pmf [on the components of $Y$] conditional on the vector $X=(x_1,\ldots,x_n)$, assuming the $Y_i$'s are independent given $X$ and that $$\mathbb P(Y_i=1|X=x,\theta)=\frac{1}{1+e^{-x_i\theta}}$$ As a joint distribution, it defines a likelihood function $$\ell(\theta|X,Y)=\prod_{i=1}^m[\frac{1}{1+e^{-x_i\theta}}]^{y_i}[1-\frac{1}{1+e^{-x_i\theta}}]^{1-y_i}$$ that can be used in a Bayesian analysis.

As a function of $\theta$, the likelihood is not a pdf, it does not integrate to one except in some specific cases (not including the logit model). The same applies when given a prior $\pi(\theta)$ one considers the product $\ell(\theta)\pi(\theta)$: it does not integrate to one. The joint distribution of $\theta$ and $Y$ is $$p(y|\theta)\pi(\theta)$$ which integrates to one in $(y,\theta)$ and the conditional distribution of $\theta$ given $Y=y$ (and $X$) is $$\dfrac{p(y|\theta)\pi(\theta)}{\int_\Theta p(y|\eta)\pi(\eta)\,\text{d}\eta}$$ which integrates to one in $\theta$. The marginal $$\int_\Theta p(y|\eta)\pi(\eta)\,\text{d}\eta$$ integrates to one in $y$ [except that it is a summation since $Y$ is discrete].

Xi'an
  • 90,397
  • 9
  • 157
  • 575
  • The scenario I gave in the edit section where $P(\theta|Y=data,X)$ represent the likelihood. If i integrate over $d\theta$ i should get 1 because it is a marginal distribution ? – calveeen Aug 08 '20 at 16:39
  • why is $p(y|x,\theta)$ a joint distribution ? I thought it is the conditioned on $x$ and $\theta$ ? – calveeen Aug 08 '20 at 16:41
  • 1
    Because $\theta$ often is not a random variable, calveeen, there isn't any joint distribution to condition on. You need to distinguish conditioning of random variables from the presence of *parameters,* especially because they often employ similar (or identical) notation. – whuber Aug 08 '20 at 16:50
  • @whuber. $\theta$ is not a random variable when used in the likelihood probability in baye’s theorem ? $p(data|\theta)$ ? – calveeen Aug 08 '20 at 17:07
  • The likelihood function does not have a Bayesian meaning per se. Especially when considering the anti-Bayesian stance of its initiator, R.A. Fisher. – Xi'an Aug 08 '20 at 17:56