I am familiar with softmax regression being written by:
$$P(Y=y\mid X=x)=\frac{e^{[Wx+b]_{y}}}{\sum_{\forall i}e^{[Wx+b]_{i}}}$$ for the change of the class of $Y$ being $y$, given observations of $X$ as being $x$. and using subscripts to denote selecting the ith column of a matrix, and the ith element of a vector. That is the formulation used in this answer
But I look at other sources, e.g. wikipedia, ufldl.stanford.edu
and it uses the formula: $$P(Y=y\mid X=x)=\frac{e^{[Wx]_{y}}}{\sum_{\forall i}e^{[Wx]_{i}}}$$
It seems to me that that bias term $b$ is clearly needed to handle the case of the classes not being balanced.
When we split the terms up: $$P(Y=y\mid X=x)=\frac{e^{[Wx+b]_{y}}}{\sum_{\forall i}e^{[Wx+b]_{i}}}=\frac{e^{[Wx]_{i}}\,e^{b{}_{y}}}{\sum_{\forall i}e^{[Wx]_{i}}\,e^{b{}_{i}}}$$ It also would seem to correspond with the prior probability term in Bayes' theorem: $$P(Y=y\mid X=x)=\frac{P(X=x\mid Y=y)\,P(Y=y)}{\sum_{\forall i}P(X=x\mid Y=i)\,P(Y=i)}$$
It seems like it is required to me, but maybe I am missing something. Why is it being left out in so many sources?