I wrote a program to classify MNIST with a vanilla neural net using sigmoid activation and back-propagation training. I tried to work through the math myself (because I want to understand things ), and the formula I ended up getting was
$dE/(dW_{ab})=$ \begin{cases} 2∗O_a∗O_b∗(1−O_b)∗(O_b−exp), & \text{if node b is an output node} \\ Oa∗Ob∗(1−Ob)∗∑_L(dE/(dO_L)∗dO_L/dx∗W_{bL}), & \text{if node b is a hidden node} \end{cases}
Where $L$ is the next layer, and $∑_L$ is the sum across that layer. This looked similar to what I saw elsewhere so I assumed it was correct. Having implemented the Neural net and trying to train it over MNIST, I found that It wasn't forking at all (effectively random results). To test it, I tried unit testing pieces individually, and I found that if I only adjust weights in the final layer I had a successful classification rate of 88% after just one epoch.
So clearly there is something wrong with the way I calculate the weight adjustments for non-output layers. The only reasons I could think of this being true are that my formula is wrong or that since the expected outputs are vectors of 9 0's and a 1, the algorithm is minimizing error by just making everything zero, and just ignoring the 1. (although I don't think either of these things are very likely).
Here is the Java code of the training algorithm. I think that variable names make enough sense that this segment is understandable without the rest of the program, but if you need to see something else, just ask.
for(int ii = 0; ii < outputLayer.size(); ii++)
{
Node n = outputLayer.get(ii);
for(Connection c : n.connections)
{
c.origin.adjustSum += c.destination.value * (1-c.destination.value) * (c.destination.value - expected[ii])*c.weight;
c.weight -= learningRate * c.origin.value * c.destination.value * (1-c.destination.value) * (c.destination.value - expected[ii]);
}
}
for(Node n : hiddenLayer)
{
for(Connection c : n.connections)
{
c.weight -= learningRate * c.origin.value * c.destination.value * (1-c.destination.value) * n.adjustSum;
}
}
I am a high-school student new to stack exchange (and computer Science), so If I have done something wrong with this question just let me know in the comments and I'll try to fix it.