11

I got the model for the logistic regression for multiclass which is given by

$$ P(Y=j|X^{(i)}) = \frac{\exp(\theta_j^TX^{(i)})}{1+ \sum_{m=1}^{k}\exp(\theta_m^T X^{(i)})} $$

where k is the number of classes theta is the parameter to be estimated j is the jth class Xi is the training data

Well one thing I didn't get is how come the denominator part $$ 1+ \sum_{m=1}^{k}\exp(\theta_m^T X^{(i)}) $$ normalized the model. I mean it makes the probability stay between 0 and 1.

I mean I am used to logistic regression being

$$ P(Y=1|X^{(i)}) = 1/ (1 + \exp(-\theta^T X^{(i)})) $$

Actually, I am confused with the nomalization thing. In this case since it is a sigmoid function it never lets the value be less than 0 or greater than 1. But I am confused in the multi class case. Why is it so?

This is my reference https://list.scms.waikato.ac.nz/pipermail/wekalist/2005-February/029738.html. I think it should have been to be normalizing $$ P(Y=j|X^{(i)}) = \frac{\exp(\theta_j^T X^{(i)})}{\sum_{m=1}^{k} \exp(\theta_m^T X^{(i)})} $$

cardinal
  • 24,973
  • 8
  • 94
  • 128
user34790
  • 6,049
  • 6
  • 42
  • 64
  • 2
    Hint: In logistic regression there are implicitly *two* probabilities to deal with: the probability $Y=1$ and the probability $Y=0$. Those probabilities must sum to $1$. – whuber Jul 05 '12 at 18:31
  • 1
    Based on some of your other posts, you know how to markup equations. The text equations here are difficult to read and the (subscripts?) are confusing - can you mark them up with $\LaTeX$? – Macro Jul 05 '12 at 18:32
  • 2
    Because you're posting so many questions here, please pause and read our FAQ about how to ask good questions. Read the help for $\TeX$ markup so you can make your equations readable. – whuber Jul 05 '12 at 18:32
  • I have edited the equation.@whuber Actually, I am confused related to multiclass logistic regression not binary one. I am concerned how come when I add all the elements in the donominator normalized the probability – user34790 Jul 05 '12 at 18:37
  • @user34790, when you divide each term by the sum, then individual class probabilities sum to 1. What is $X^{(i)}$ by the way? – Macro Jul 05 '12 at 18:40
  • @macro it is the ith training example. I just didn't get it how come dividing by the sum makes the class probabilites sum to 1 – user34790 Jul 05 '12 at 18:42
  • Have you checked the Wikipedia entry on [Multinomial logit](http://en.wikipedia.org/wiki/Multinomial_logit)? – assumednormal Jul 06 '12 at 10:39
  • I have tried to clean up some of the $\TeX$ markup. Please check that I have not inadvertently introduced errors to the *content* of the equations. Cheers. – cardinal Jul 06 '12 at 12:39
  • and made my answer in the OP's notation consistent with it. The @cardinal is indeed thorough... – conjugateprior Jul 06 '12 at 13:41

2 Answers2

14

Your formula is wrong (the upper limit of the sum). In logistic regression with $K$ classes ($K> 2$) you basically create $K-1$ binary logistic regression models where you choose one class as reference or pivot. Usually, the last class $K$ is selected as the reference. Thus, the probability of the reference class can be calculated by $$P(y_i = K | x_i) = 1 - \sum_{k=1}^{K-1} P(y_i = k | x_i) .$$ The general form of the probability is $$P(y_i = k | x_i) = \frac{\exp(\theta_i^T x_i)}{\sum_{i=1}^K \exp(\theta_i^T x_i)} .$$ As the $K$-th class is your reference $\theta_K = (0, \ldots, 0)^T$ and therefore $$\sum_{i=1}^K \exp(\theta_i^T x_i) = \exp(0) + \sum_{i=1}^{K-1} \exp(\theta_i^T x_i) = 1 + \sum_{i=1}^{K-1} \exp(\theta_i^T x_i) .$$ In the end you get the following formula for all $k < K$: $$ P(y_i = k | x_i) = \frac{\exp(\theta_i^T x_i)}{1 + \sum_{i=1}^{K-1} \exp(\theta_i^T x_i)} $$

sebp
  • 1,787
  • 13
  • 24
  • 4
    note that the choice of reference class is not important, if you are doing maximum likelihood. But if you are doing penalised maximum likelihood, or bayesian inference, it can often be more useful to leave the probabilities over-parameterised, and let the penalty chose a way of handling the over-parameterisation. This is because most penalty functions/priors are not invariant with respect to the choice of reference class – probabilityislogic Jul 06 '12 at 12:25
  • @sebp, it seems that $i$ is a bit confusing; it would be better to use $i$ for observation, and some other letter for category $k$ iteration. – garej Jul 07 '17 at 07:09
5

I think you're being confused by a typo: Your $k$ should be $k-1$ in the first equation. The 1's you see in the logistic case are actually $\exp(0)$s, e.g., when there is a $k$th $\theta=0$.

Assume that $\theta_1 X=b$. Now notice that you can get from the last formulation to the logistic regression version like $$ \frac{\exp(b)}{\exp(0)+\exp(b)} = \frac{\exp(0)}{\exp(0)+\exp(-b)} = \frac{1}{1+\exp(-b)} $$ For multiple classes, just replace the denominator in the first two quantities by a sum over exponentiated linear predictors.

cardinal
  • 24,973
  • 8
  • 94
  • 128
conjugateprior
  • 19,431
  • 1
  • 55
  • 83