10

I'm using Softmax regression for a multi-class classification problem. I don't have equal prior probabilities for each of the classes.

I know from Logistic Regression (softmax regression with 2 classes) that the prior probabilities of the classes is implicitly added to the bias ($\log(p_0/p_1)$).

Usually what I do is to manually remove this term from the bias.

My question is, what is the corresponding term in softmax regression bias?

Thanks.

Ran
  • 1,476
  • 3
  • 16
  • 25

1 Answers1

4

As far as I'm aware, the justification for softmax bias initialization is a bit hand-wavy. Recall softmax regression is maximum (log) likelihood estimation for $W,\textbf{b}$, with the model being the following: $$ \DeclareMathOperator{cat}{Cat} \newcommand{\norm}[1]{\left\| #1 \right\|} \newcommand{vsigma}{{\boldsymbol\sigma}} \newcommand{vx}{{\textbf{x}}} \newcommand{vb}{{\textbf{b}}} \newcommand{vz}{{\textbf{z}}} y\sim\cat(\vsigma(W\vx+\vb)); \;\;\;\sigma_i(\vz)=\frac{\exp z_i}{\sum_j\exp z_j}. $$ With bias initialization our intention is to find a good value $\vb$ with which $p(\vx, y|W,\vb)\propto p( y|W,\vb,\vx) $ starts out high. Under the assumption that we initialize $W$ with small near-0 values and that $y$ is a label in $[K]$, $W\vx\approx 0$ so: $$ \log p( y|W,\vb,\vx)=\sum_{k=1}^K1_{y=k}\log \sigma_k(W\vx + \vb)\approx\log\sigma_y(\vb) $$ Adding up the log-probabilities for all assumed-independent examples $\{(\vx_i,y_i)\}_{i=1}^n$, a good initialization for $\vb$ would minimize the total approximate data log likelihood: $$ \newcommand{vc}{{\textbf{c}}} \sum_{i=1}^n\log\sigma_{y_i}(\vb)=\sum_{i=1}^nb_{y_i}-n\log\sum_{k=1}^K\exp b_k$$ The gradient of the above wrt $\vb$ is $\vc-n\vsigma(\vb)$, with $\vc\in\mathbb{N}^K$ the vector of counts of each class. The function above is also concave, see the question here about smooth max for a proof.

The two facts above imply a maximum is available whenever $\vsigma(\vb)=\vc/n$. This, in turn, suggests a viable initialization for the $i$-th term $b_i$ of the bias $\vb$ is indeed $\log p_i$, the proportion of $i$-labelled examples in the training set (aka the marginal statistics). You might see that you can add any constant to $\vb$ and achieve another likelihood-maximizing bias as well; however, a large scale would get in the way of learning $W$. The relationship with the logistic bias is not coincidental --- this tutorial discusses the similarity.

VF1
  • 745
  • 5
  • 17