11

Is it correct to say that binary logistic regression is a special case of multinomial logistic regression when the outcome has 2 levels?

Franck Dernoncourt
  • 42,093
  • 30
  • 155
  • 271
nostock
  • 1,337
  • 4
  • 15
  • 22
  • 2
    For another answer, pretty much equivalent to the two answers here but with a different presentation: [Softmax vs Sigmoid function in Logistic classifier](http://stats.stackexchange.com/a/254071/12359) – Franck Dernoncourt Jan 02 '17 at 14:39

2 Answers2

29

Short answer: Yes.

Longer answer:

Consider a dependent variable $y$ consisting $J$ categories, than a multinomial logit model would model the probability that $y$ falls in category $m$ as:

$ \mathrm{Pr}(y=m | x) = \frac{\exp(x\beta_m)}{\sum_{j=1}^J \exp(x\beta_j)} $

where $\beta_1 = 0$.

So if $y$ has three categories (1,2,3), you could get the three probabilities as:

$ \mathrm{Pr}(y=1 | x) = \frac{\exp(x0)}{\exp(x0) + \exp(x\beta_2) + \exp(x\beta_3)} = \frac{1}{1 + \exp(x\beta_2) + \exp(x\beta_3)} $

$ \mathrm{Pr}(y=2 | x) = \frac{\exp(x\beta_2)}{1 + \exp(x\beta_2) + \exp(x\beta_3)} $

$ \mathrm{Pr}(y=3 | x) = \frac{\exp(x\beta_3)}{1 + \exp(x\beta_2) + \exp(x\beta_3)} $

In your special case where $y$ has two categories this condences to:

$ \mathrm{Pr}(y=1 | x) = \frac{1}{1 + \exp(x\beta_2) } $

$ \mathrm{Pr}(y=2 | x) = \frac{\exp(x\beta_2)}{1 + \exp(x\beta_2) } $

This is exactly a binary logistic regression.

Maarten Buis
  • 19,189
  • 29
  • 59
  • Thanks for the answer. It makes a lot of sense to me apart from one question. Why can we assume beta1 as 0? – Allen Dec 05 '17 at 22:32
  • 5
    @Allen If there are 3 categories and we know two of these probabilities, then we also know the third, as these probabilities add up to 1. So with $m$ categories we cannot estimate $m$ probabilities but $m-1$. This means we need a constraint on one set of $\beta$s. Setting $\beta_1=0$ is equivalent to saying $Pr(y=1)= 1- \sum_{m=2}^M Pr(y=m)$ – Maarten Buis Jan 03 '18 at 09:17
18

The accepted answer and the answer by Franck are both awesome, but I want to detail a bit the process about the subtraction in $\beta$. And I hope my answer is an easy-to-read one.

The general equation for the probability of the dependent variable $Y_i$ being fallen into the category $c$(in all $K$ categories), given the sample(observation) $X_i$, is defined as:
$$Pr(Y_i=c) = \frac{e^{\beta_c*X_i}}{\sum^K_{k=1}e^{\beta_k*X_i}}$$

Since for all $Y_i$, $\sum^K_{k=1}Pr(Y_i=k)$ is 1, then there must be a certain $Pr(Y_i=c)$ that is determined by all the rest probabilities( when $k\neq c$). As a result, there are only $K-1$ separately specifiable coefficients($\beta$) for each sample $X_i$.

According to the characterizations of the aforementioned equation, the probability remains the same if we subtract a constant from each item in the $\beta$ vector. That is:

$$\frac{e^{(\beta_c+C)*X_i}}{\sum^K_{k=1}e^{(\beta_k+C)*X_i}}=\frac{e^{C*X_i}e^{\beta_cX_i}}{e^{C*X_i}\sum^K_{k=1}e^{\beta_k*X_i}}=\frac{e^{\beta_cX_i}}{\sum^K_{k=1}e^{\beta_kX_i}}$$

Then it is reasonable for us to make $C=-\beta_K$(alternatively any other $k$), resulting in the Kth item of the new coefficient for each vector $\beta$ becomes 0(only $K-1$ categories are considered separately specifiable now).

$\beta'_1=\beta_1-\beta_K$
......
$\beta'_{K-1}=\beta_{K-1}-\beta_K$
$\beta'_K=\beta_K-\beta_K=0$

Hence we can transfer the first general equation in the form of:
$$Pr(Y_i=c) = \frac{e^{\beta'_c*X_i}}{e^{0*X_i}+\sum^{K-1}_{k=1}e^{\beta'_k*X_i}} = \frac{e^{\beta'_c*X_i}}{1+\sum^{K-1}_{k=1}e^{\beta'_k*X_i}}$$

When we have two categories, that is when the distribution of the dependent variable is binomial, or $K=2$, we have(if the Kth category is when $Y_i=2$):
$$Pr(Y_i=1) = \frac{e^{\beta'_1*X_i}}{1+e^{\beta'_1*X_i}}=\frac{1}{1+e^{-\beta'_1*X_i}}$$, and $$Pr(Y_i=2) = 1-Pr(Y_i=1)$$

That is the familiar equations you see in binomial logistic regression.

Reference: https://en.wikipedia.org/wiki/Multinomial_logistic_regression

Lerner Zhang
  • 5,017
  • 1
  • 31
  • 52