Large difference in accuracy for sigmoid vs softmax

Question

I am experimenting on a neural network model I found on Kaggle for Titanic dataset where the problem statement is to determine whether a person has survived or not.

The input I am providing is of this type:


Pclass  Age         Fare    Sex_female  Sex_male
789 1   46.000000   79.2000 0           1
543 2   32.000000   26.0000 0           1
109 3   29.699118   24.1500 1           0
111 3   14.500000   14.4542 1           0
726 2   30.000000   21.0000 1           0
... ... ... ... ... ...
559 3   36.000000   17.4000 1           0
648 3   29.699118   7.5500  0           1
556 1   48.000000   39.6000 1           0
302 3   19.000000   0.0000  0           1
786 3   18.000000   7.4958  1           0

This is the model I found(changed input_shape, but everything else is similar):

model = Sequential()
model.add(Dense(units = 32, input_shape = (5,), activation = 'relu'))
model.add(Dense(units = 64, activation = 'relu', kernel_initializer = 'he_normal', use_bias = False))
model.add(tf.keras.layers.BatchNormalization())
model.add(Dense(units = 128, activation = 'relu',kernel_initializer = 'he_normal', use_bias = False))
model.add(Dropout(0.1))
model.add(Dense(units = 64, activation = 'relu',kernel_initializer = 'he_normal', use_bias = False))
model.add(Dropout(0.1))
model.add(Dense(units = 32, activation = 'relu'))
model.add(Dropout(0.15))
model.add(Dense(units = 16, activation = 'relu'))
model.add(Dense(units = 8, activation = 'relu',kernel_initializer = 'he_normal', use_bias = False))
model.add(Dense(units =1 , activation = 'sigmoid'))

I used the following:

model.compile(optimizer=SGD(lr=0.1),
              loss='binary_crossentropy',
              metrics=['binary_accuracy'])

The doubt I had was when I replace the sigmoid to softmax, the accuracy takes about a 30% hit. As far as my understanding goes, isn't softmax better when there are 2 classes, such as survived or not in this case and each row can fall into only one class, not both?

Sycorax · Accepted Answer · 2021-03-01T20:38:09.977

I think the phenomenon you're observing is just a consequence of the optimization procedure terminating prematurely. The softmax network is not obtaining a parameter configuration that is equivalent to the sigmoid loss. The sigmoid network has a lower loss, so we know the softmax network isn’t done training. We can prove that mathematically there must be such a parameter configuration, because a sigmoid network is a special case of a softmax network.

Since the two networks are alternative, equivalent representations of the same function, just use the sigmoid network.

You can get the softmax network to match the sigmoid network by using all of the usual tricks: train for more epochs, schedule the learning rate, adjust the momentum parameters, etc. But this will, at best, just match the sigmoid network.

Proof:

Suppose you have a network $f(x)$ which has two output neurons $f(x)_1, f(x)_2$. These outputs are unconstrained real values, such as the outputs of a linear layer.

You can apply the softmax function to them to yield a probability vector: $$ \hat{p}_i = \frac{\exp(f(x)_i)}{\displaystyle \sum_j \exp(f(x)_j)} $$

Now suppose we use a linear layer and then apply softmax. In general, this looks like

$$ \hat{p}_i= \frac{\exp(a_i f(x)_i + b_i)}{\displaystyle \sum_j \exp(a_j f(x)_j + b_j)} $$

But for a specific choice of $a_j, b_j$, we can recover the sigmoid activation exactly.

$$\begin{align} \hat{p}_1 &= \frac{\exp( f(x)_1 + 0)}{\exp( f(x)_1 + 0) + \exp(0 f(x)_2 + 0)} \\[1em] &= \frac{\exp(f(x)_1)}{\exp(f(x)_1) + 1} \\[1em] &= \sigma(f(x)_1) \end{align} $$

and of course we now know $\hat{p}_1 = 1 - \hat{p}_2$ because probabilities have measure 1.

isn't softmax better when there are 2 classes

The proof shows that in the case of 2 classes, the sigmoid network is a special case of the softmax network. When both networks attain the same loss, one isn't "better" than the other; they're simply tied. Your experiments have shown that a sigmoid network can be "better" in the sense that it has a lower loss and a higher accuracy than the softmax network when trained for the same number of iterations, but this is purely an artifact of not training the softmax network to an optimum. If you were using a simple regression model, you wouldn't stop part-way to the solution; so stopping the softmax model early isn't a fair comparison.

Is there any case where softmax outperforms sigmoid(in terms of accuracy)? Or is it that for any case, the best a softmax network can do is match sigmoid? I ask this because [this answer](https://stats.stackexchange.com/a/410112/312922) says use softmax for when each row can only belong to one class. Yet it did not perform well. So is there any other characteristic of a problem than can help determine sigmoid or softmax? — twothreezarsix, Mar 01 '21 at 19:07
As I wrote in my answer, softmax would work fine if you trained the network longer. I even proved that it's a special case of a sigmoid when you have 2 cases. One reason to use softmax is if you have 3 or more mutually exclusive classes. The answer you link to is comparing softmax in the case of mutually exclusive classes to the case of non-mutually exclusive classes. You've already established that your problem has mutually exclusive classes, so that comparison isn't relevant to your problem. — Sycorax, Mar 01 '21 at 19:19

Large difference in accuracy for sigmoid vs softmax

1 Answers1