Number of parameters in sigmoid vs. softmax cross entropy

Question

Assume I have a data point $\mathbf{x} = [x_1, x_2, \ldots, x_D]^\top$ which I want to classify into one of two mutually exclusive categories $\mathcal{C}_0$ and $\mathcal{C}_1$. I can create a simple neural network with $D+1$ parameters and train with sigmoid cross entropy loss: $$ \hat{y}_1 = \sigma(\mathbf{w}_1^\top\mathbf{x} + b_1) \tag{1}\label{1} $$ where $\mathbf{w}_1 \in \mathbb{R}^D$ and $b_1 \in \mathbb{R}$ and $\sigma(z) = 1/(1 + \exp(-z))$. The label here would be a scalar $0$ or $1$.

Or I could create a network with $2D+2$ parameters and train with softmax cross entropy loss: $$ \mathbf{\hat{y}}_2 = \mbox{softmax}(\mathbf{W}_2\mathbf{x} + \mathbf{b}_2) \tag{2}\label{2} $$ where $\mathbf{W}_2 \in \mathbb{R}^{2 \times D}$ and $\mathbf{b}_2 \in \mathbb{R}^2$. Here, the dimensions of $\mathbf{y}_2$ sum to $1$ because of the softmax. The label here would be either the one-hot vector $[1, 0]^\top$ or $[0, 1]^\top$.

Questions:

Does it ever make sense to use the form in equation $\eqref{2}$ over equation $\eqref{1}$ given that it has twice the number of parameters?
When we use softmax and go from 2-class classification to 3-class classification, we increase the number of parameters from $D+1$ to $3D + 3$. However, intuitively it seems that if we can do 2-class classification with $D+1$ parameters, we should be able to do 3-class with $2D + 2$. Is this intuition correct, and if so, is this normally done in practice?

Your question is related to [my question here](https://stats.stackexchange.com/questions/501683/overparameterization-with-softmax-with-neural-networks). As far as my argument go, there are no reasons to use the overparameterized softmax function. — Benjamin Christoffersen, Dec 21 '20 at 05:46

score 1 · Answer 1 · answered Dec 21 '20 at 09:15

Does it ever make sense to use the form in equation (2) over equation (1) given that it has twice the number of parameters?

As far as I see, I have not found any arguments for doing this as stated in this question. With equation (2) you get:

an infinite number of solutions with almost all neural network architectures.
at worst (the binary specification and with some architectures) twice the number of parameters half of which are redundant.
greater number of flops required to evaluate the loss and the gradient and greater storage requirements for your tape when using automatic differentiation.
a singular Hessian of the loss function which can cause problems with some optimization methods.
possibly slower convergence (more loss function and gradient evaluations are required).

When we use softmax and go from 2-class classification to 3-class classification, we increase the number of parameters from $D+1$ to $3D+3$. However, intuitively it seems that if we can do 2-class classification with D+1 parameters, we should be able to do 3-class with $2D+2$. Is this intuition correct, and if so, is this normally done in practice?

Your intuition is correct. You can do with $2D+2$ by setting one of the softmax arguments to always be zero. The reason is the sum-to-one constraint. Doing this yields the sigmoid function (logit link) as a special case.

I am not too familiar with practice but it does seem that some use the overparameterized $3D+3$ specification for reason that are not clear to me. I cannot answer what is normally done as I am not too familiar with the field.

Number of parameters in sigmoid vs. softmax cross entropy

1 Answers1

Linked