I am trying to perform multi-label classification using neural network, with 3 hidden ReLU layers and a softmax output layer. I'm also using mini-batch SGD as the optimization algorithm and negative log likelihood as the loss function with L2 regularization. After performing feature scaling this is what I'm noticing is the following:
- Algorithm performance plateaus in terms of validation and test error very early during training
- Virtually no hyperparameter other than mini-batch size factors into the algorithm's performance even if I choose seemingly absurd values (e.g learning rate, number of hidden units, L2 reg coefficient)
- I'm getting better performance in the non-scaled version of the data set, rather than the scaled one.
Especially bothersome for me is the second point, as I cannot explain it at all. Surely, the learning rate or the number of units in the hidden layers should affect performance even slightly, however that's not what I'm seeing. If anyone has any pointers for explaining and even rectifying the situation I would be really grateful.
Thanks in advance.