Assume I have a data point $\mathbf{x} = [x_1, x_2, \ldots, x_D]^\top$ which I want to classify into one of two mutually exclusive categories $\mathcal{C}_0$ and $\mathcal{C}_1$. I can create a simple neural network with $D+1$ parameters and train with sigmoid cross entropy loss: $$ \hat{y}_1 = \sigma(\mathbf{w}_1^\top\mathbf{x} + b_1) \tag{1}\label{1} $$ where $\mathbf{w}_1 \in \mathbb{R}^D$ and $b_1 \in \mathbb{R}$ and $\sigma(z) = 1/(1 + \exp(-z))$. The label here would be a scalar $0$ or $1$.
Or I could create a network with $2D+2$ parameters and train with softmax cross entropy loss: $$ \mathbf{\hat{y}}_2 = \mbox{softmax}(\mathbf{W}_2\mathbf{x} + \mathbf{b}_2) \tag{2}\label{2} $$ where $\mathbf{W}_2 \in \mathbb{R}^{2 \times D}$ and $\mathbf{b}_2 \in \mathbb{R}^2$. Here, the dimensions of $\mathbf{y}_2$ sum to $1$ because of the softmax. The label here would be either the one-hot vector $[1, 0]^\top$ or $[0, 1]^\top$.
Questions:
- Does it ever make sense to use the form in equation $\eqref{2}$ over equation $\eqref{1}$ given that it has twice the number of parameters?
- When we use softmax and go from 2-class classification to 3-class classification, we increase the number of parameters from $D+1$ to $3D + 3$. However, intuitively it seems that if we can do 2-class classification with $D+1$ parameters, we should be able to do 3-class with $2D + 2$. Is this intuition correct, and if so, is this normally done in practice?