Why is softmax regression often written without the bias term?

Question

I am familiar with softmax regression being written by:

$$P(Y=y\mid X=x)=\frac{e^{[Wx+b]_{y}}}{\sum_{\forall i}e^{[Wx+b]_{i}}}$$ for the change of the class of $Y$ being $y$, given observations of $X$ as being $x$. and using subscripts to denote selecting the ith column of a matrix, and the ith element of a vector. That is the formulation used in this answer

But I look at other sources, e.g. wikipedia, ufldl.stanford.edu

and it uses the formula: $$P(Y=y\mid X=x)=\frac{e^{[Wx]_{y}}}{\sum_{\forall i}e^{[Wx]_{i}}}$$

It seems to me that that bias term $b$ is clearly needed to handle the case of the classes not being balanced.

When we split the terms up: $$P(Y=y\mid X=x)=\frac{e^{[Wx+b]_{y}}}{\sum_{\forall i}e^{[Wx+b]_{i}}}=\frac{e^{[Wx]_{i}}\,e^{b{}_{y}}}{\sum_{\forall i}e^{[Wx]_{i}}\,e^{b{}_{i}}}$$ It also would seem to correspond with the prior probability term in Bayes' theorem: $$P(Y=y\mid X=x)=\frac{P(X=x\mid Y=y)\,P(Y=y)}{\sum_{\forall i}P(X=x\mid Y=i)\,P(Y=i)}$$

It seems like it is required to me, but maybe I am missing something. Why is it being left out in so many sources?

check @Tim's answer. Not only softmax regression, most linear methods are similar to this case. — Haitao Du, Aug 21 '17 at 14:07

Tim · Answer 1 · 2017-08-21T09:11:39.870

6

If you use matrix notation, then

$$ \beta_0 + \beta_1 X_1 + \dots +\beta_k X_k $$

can be defined in terms of design matrix that already contains a column of ones for the intercept

$$ \mathbf{X} = \left[ \begin{array}{cccc} 1 & x_{1,1} & \dots & x_{1,k} \\ 1 & x_{2,1} & \dots & x_{2,k} \\ \vdots & \vdots & \ddots & \vdots \\ 1 & x_{n,1} & \dots & x_{n,k} \end{array} \right] $$

so writing $\beta_0 + \dots$ is redundant.

edited Aug 21 '17 at 09:11

answered Aug 21 '17 at 08:27

Tim

108,699
20
212
390

Indeed it can be, I would have expected the source to mention this though. I see looking at the [source code attracted to the ufldl page](https://github.com/amaas/stanford_dl_ex/blob/master/ex1/ex1c_softmax.m) it does indeed do just that. I guess my question thus becomes: Why is this considered a preferable way to write the expression? (Given it hides the similarities to Bayes) – Lyndon White Aug 21 '17 at 08:34
@LyndonWhite because it yields simpler notation..? It is also more general, since you use one notation for case where intercept is used, or is not used. Moreover, it does not matter for the definition of the model, the intercept is needed for the results to remove bias, but the model itself "doesn't care" about it. – Tim Aug 21 '17 at 08:45

Why is softmax regression often written without the bias term?

1 Answers1

Linked