1

Assume I have a data point $\mathbf{x} = [x_1, x_2, \ldots, x_D]^\top$ which I want to classify into one of two mutually exclusive categories $\mathcal{C}_0$ and $\mathcal{C}_1$. I can create a simple neural network with $D+1$ parameters and train with sigmoid cross entropy loss: $$ \hat{y}_1 = \sigma(\mathbf{w}_1^\top\mathbf{x} + b_1) \tag{1}\label{1} $$ where $\mathbf{w}_1 \in \mathbb{R}^D$ and $b_1 \in \mathbb{R}$ and $\sigma(z) = 1/(1 + \exp(-z))$. The label here would be a scalar $0$ or $1$.

Or I could create a network with $2D+2$ parameters and train with softmax cross entropy loss: $$ \mathbf{\hat{y}}_2 = \mbox{softmax}(\mathbf{W}_2\mathbf{x} + \mathbf{b}_2) \tag{2}\label{2} $$ where $\mathbf{W}_2 \in \mathbb{R}^{2 \times D}$ and $\mathbf{b}_2 \in \mathbb{R}^2$. Here, the dimensions of $\mathbf{y}_2$ sum to $1$ because of the softmax. The label here would be either the one-hot vector $[1, 0]^\top$ or $[0, 1]^\top$.

Questions:

  1. Does it ever make sense to use the form in equation $\eqref{2}$ over equation $\eqref{1}$ given that it has twice the number of parameters?
  2. When we use softmax and go from 2-class classification to 3-class classification, we increase the number of parameters from $D+1$ to $3D + 3$. However, intuitively it seems that if we can do 2-class classification with $D+1$ parameters, we should be able to do 3-class with $2D + 2$. Is this intuition correct, and if so, is this normally done in practice?
Vivek Subramanian
  • 2,613
  • 2
  • 19
  • 34
  • Your question is related to [my question here](https://stats.stackexchange.com/questions/501683/overparameterization-with-softmax-with-neural-networks). As far as my argument go, there are no reasons to use the overparameterized softmax function. – Benjamin Christoffersen Dec 21 '20 at 05:46

1 Answers1

1
  1. Does it ever make sense to use the form in equation (2) over equation (1) given that it has twice the number of parameters?

As far as I see, I have not found any arguments for doing this as stated in this question. With equation (2) you get:

  • an infinite number of solutions with almost all neural network architectures.
  • at worst (the binary specification and with some architectures) twice the number of parameters half of which are redundant.
  • greater number of flops required to evaluate the loss and the gradient and greater storage requirements for your tape when using automatic differentiation.
  • a singular Hessian of the loss function which can cause problems with some optimization methods.
  • possibly slower convergence (more loss function and gradient evaluations are required).
  1. When we use softmax and go from 2-class classification to 3-class classification, we increase the number of parameters from $D+1$ to $3D+3$. However, intuitively it seems that if we can do 2-class classification with D+1 parameters, we should be able to do 3-class with $2D+2$. Is this intuition correct, and if so, is this normally done in practice?

Your intuition is correct. You can do with $2D+2$ by setting one of the softmax arguments to always be zero. The reason is the sum-to-one constraint. Doing this yields the sigmoid function (logit link) as a special case.

I am not too familiar with practice but it does seem that some use the overparameterized $3D+3$ specification for reason that are not clear to me. I cannot answer what is normally done as I am not too familiar with the field.