7

I am familiar with softmax regression being written by:

$$P(Y=y\mid X=x)=\frac{e^{[Wx+b]_{y}}}{\sum_{\forall i}e^{[Wx+b]_{i}}}$$ for the change of the class of $Y$ being $y$, given observations of $X$ as being $x$. and using subscripts to denote selecting the ith column of a matrix, and the ith element of a vector. That is the formulation used in this answer

But I look at other sources, e.g. wikipedia, ufldl.stanford.edu

and it uses the formula: $$P(Y=y\mid X=x)=\frac{e^{[Wx]_{y}}}{\sum_{\forall i}e^{[Wx]_{i}}}$$

It seems to me that that bias term $b$ is clearly needed to handle the case of the classes not being balanced.

When we split the terms up: $$P(Y=y\mid X=x)=\frac{e^{[Wx+b]_{y}}}{\sum_{\forall i}e^{[Wx+b]_{i}}}=\frac{e^{[Wx]_{i}}\,e^{b{}_{y}}}{\sum_{\forall i}e^{[Wx]_{i}}\,e^{b{}_{i}}}$$ It also would seem to correspond with the prior probability term in Bayes' theorem: $$P(Y=y\mid X=x)=\frac{P(X=x\mid Y=y)\,P(Y=y)}{\sum_{\forall i}P(X=x\mid Y=i)\,P(Y=i)}$$

It seems like it is required to me, but maybe I am missing something. Why is it being left out in so many sources?

Tim
  • 108,699
  • 20
  • 212
  • 390
Lyndon White
  • 2,744
  • 1
  • 19
  • 35

1 Answers1

6

If you use matrix notation, then

$$ \beta_0 + \beta_1 X_1 + \dots +\beta_k X_k $$

can be defined in terms of design matrix that already contains a column of ones for the intercept

$$ \mathbf{X} = \left[ \begin{array}{cccc} 1 & x_{1,1} & \dots & x_{1,k} \\ 1 & x_{2,1} & \dots & x_{2,k} \\ \vdots & \vdots & \ddots & \vdots \\ 1 & x_{n,1} & \dots & x_{n,k} \end{array} \right] $$

so writing $\beta_0 + \dots$ is redundant.

Tim
  • 108,699
  • 20
  • 212
  • 390
  • Indeed it can be, I would have expected the source to mention this though. I see looking at the [source code attracted to the ufldl page](https://github.com/amaas/stanford_dl_ex/blob/master/ex1/ex1c_softmax.m) it does indeed do just that. I guess my question thus becomes: Why is this considered a preferable way to write the expression? (Given it hides the similarities to Bayes) – Lyndon White Aug 21 '17 at 08:34
  • @LyndonWhite because it yields simpler notation..? It is also more general, since you use one notation for case where intercept is used, or is not used. Moreover, it does not matter for the definition of the model, the intercept is needed for the results to remove bias, but the model itself "doesn't care" about it. – Tim Aug 21 '17 at 08:45