I have a softmax layer; I.e. a weight matrix with no hidden layers before it that gets applied to the input, then the output gets passed to softmax. I'm wondering if this alone can be used (with gradient descent on the cross-entropy error) to learn any classification problems, or if a MLP hidden layer is basically a requirement between the softmax layer and the input layer. If I try on the pima-indians-diabetes dataset, it'll essentially descend to answering only one class out of the two (the 65% class of 0).
Also, my implementation is unable to learn the iris dataset.
I'd also like to learn how to gradient check the cross-entropy with numerical approximation.
This is more information than necessary, but I have the softmax layer connected to the output of a reservoir for reservoir computing. Also, I'm having trouble figuring out what the gradient would be of the cross entropy with respect to the hidden layer's input weight matrix.
Believe me, I've been searching around a lot on the internet and can't really find an answer to this particular set of questions. Any and all help appreciated!