2I am trying to implement a fully vectorized neural net following this example Stanford. I am using C# with Math.net numerical library of which the binaries can be downloaded here NuGet. Because the Stanford example is not complete (bias backpropagation and gradient descend missing), I had to try and define these myself from the theory that is linked in the text. As a test I use a simple XOR function that creates input-output sets. In the end I have code that compiles and runs, but it doesn't converge to an answer. It stays at 0.5 for all 4 different input-output sets, so something is wrong.
I am new to Cross Validated. Typically I would post on StackOverflow with a C# tag, but my question is not really a C# question, it's more related to understanding how to implement backpropagation and gradient descent in a vectorized manner.
I am a bit reluctant to dump the whole source code, even though it's a small program (160 lines), so I'll start with the main body. If someone wants to try it I can provide the rest too.
Notations are as per the Stanford example, except that I've used capitals for Matrices. The input X is a matrix of 2 x mSets, where mSets are 4 possible input series of the XOR. W1, W2, b1 and b2 are randomly initialized between 0 and e, where e = 0.01. I have also tried stacking the biases on top of the inputs to the hidden and export layers, but this creates dimensionality problems in the back-propagation. Or, in other words, I don't know how to do it.
My guess is I have the gradient descent wrong. I am hoping that anyone who has been here before will easily spot my mistakes.
UPDATE: I have solved the problem. There are two main issues, one is fundamental: 1) it is not possible to solve the XOR problem with 2x2x1 network without bias, 2) the other has to do with how to implement the bias. The problem is that by introducing the bias, the dimensions are increased, yet during backpropagation, you need to reduce this dimension again. This is shown in the updated code below: remCol()
removes the last column from the matrix.
There are 2 sources that helped me: Stephen Welch has a nice series of videos on Youtube which is very similar to the Stanford code, but does not address bias, however Ryan Harris does! His code is in Python, but in chapter 3 (Youtube) he mentions that it is important not to pass back the bias during backpropagation. That was the trick.
// run the network
for (int i = 0; i < maxIter; i++)
{
// forward propagation
Z2 = W1 * X.Stack(b1); // add a row of 1's at the bottom for bias
A2 = Z2.Map(SpecialFunctions.Logistic);
Z3 = W2 * A2.Stack(b1);
H = Z3.Map(SpecialFunctions.Logistic);
// backpropagation
J = 0.5 * sum(power(Y-H, 2)) / X.RowCount + lambda / 2.0 * (sum(power(W1, 2)) + sum(power(W2, 2))); // cost function
delta3 = pointWise(H - Y, Z3.Map(inverseLogistic)); // delta output layer
gradW2 = delta3 * transpose(A2.Stack(b1)) / X.RowCount + lambda * W2; // grad = partial derivatives, divide by X.Rowcount for scaling and add regularization term
delta2 = pointWise(transpose(remCol(W2, nHid)) * delta3, Z2.Map(inverseLogistic)); // delta hidden layer, remove last column from matrix (bias)
gradW1 = delta2 * transpose(X.Stack(b1)) / X.RowCount + lambda * W1;
// gradient descent
W1 -= alpha * gradW1;
W2 -= alpha * gradW2;
if (i % 100 == 0)
Console.WriteLine("Error J " + J.ToString("E4"));
}