3

Apparently, the sigmoid function $\sigma(x_i) = \frac{1}{1+e^{-x_i}}$ is generalization of the softmax function $\text{softmax}(x_i) = \frac{e^{x_i}}{\sum_{j=1}^{n}{e^{x_j}}}$. As far I've understood, sigmoid outputs the same result like the softmax function in a binary classification problem. I've tried to prove this, but I failed:

$\text{softmax}(x_0) = \frac{e^{x_0}}{e^{x_0} + e^{x_1}} = \frac{1}{1+e^{x_1 - x_0 }} \neq \frac{1}{1+e^{-x_0 }} = \text{sigmoid}(x_0)$

Do I misunderstand something? How can I prove, that sigmoid and softmax behave equally in a binary classification problem?

null
  • 133
  • 1
  • 5

1 Answers1

2

They are, in fact, equivalent, in the sense that one can be transformed into the other.

Suppose that your data is represented by a vector $\boldsymbol{x}$, of arbitrary dimension, and you built a binary classifier for it, using an affine transformation followed by a softmax:

\begin{equation} \begin{pmatrix} z_0 \\ z_1 \end{pmatrix} = \begin{pmatrix} \boldsymbol{w}_0^T \\ \boldsymbol{w}_1^T \end{pmatrix}\boldsymbol{x} + \begin{pmatrix} b_0 \\ b_1 \end{pmatrix}, \end{equation} \begin{equation} P(C_i | \boldsymbol{x}) = \text{softmax}(z_i)=\frac{e^{z_i}}{e^{z_0}+e^{z_1}}, \, \, i \in \{0,1\}. \end{equation}

Let's transform it into an equivalent binary classifier that uses a sigmoid instead of the softmax. First of all, we have to decide which is the probability that we want the sigmoid to output (which can be for class $C_0$ or $C_1$). This choice is absolutely arbitrary and so I choose class $C_0$. Then, my classifier will be of the form:

\begin{equation} z' = \boldsymbol{w}'^T \boldsymbol{x} + b', \end{equation} \begin{equation} P(C_0 | \boldsymbol{x}) = \sigma(z')=\frac{1}{1+e^{-z'}}, \end{equation} \begin{equation} P(C_1 | \boldsymbol{x}) = 1-\sigma(z'). \end{equation}

The classifiers are equivalent if the probabilities are the same, so we must impose:

\begin{equation} \sigma(z') = \text{softmax}(z_0) \end{equation}

Replacing $z_0$, $z_1$ and $z'$ by their expressions in terms of $\boldsymbol{w}_0,\boldsymbol{w}_1, \boldsymbol{w}', b_0, b_1, b'$ and $\boldsymbol{x}$ and doing some straightforward algebraic manipulation, you may verify that the equality above holds if and only if $\boldsymbol{w}'$ and $b'$ are given by:

\begin{equation} \boldsymbol{w}' = \boldsymbol{w}_0-\boldsymbol{w}_1, \end{equation} \begin{equation} b' = b_0-b_1. \end{equation}

learner
  • 637
  • 6
  • 14
  • Thanks. If they were equivalent, why does my approach not work? I cannot prove equality. When feeding softmax and sigmoid with the same binary input data, they return different results. How can they be equivalent? – null Jun 24 '17 at 17:55
  • moved the discussion to the topic above (https://stats.stackexchange.com/questions/233658/softmax-vs-sigmoid-function-in-logistic-classifier/287294#287294), since this is in fact a duplicate. please check my comments there. – learner Jun 25 '17 at 23:21
  • See link above you to additional explanations that may be very helpful to understand the idea behind the transformation. – null Jun 27 '17 at 11:06