Neural network vectorization, no convergence

Question

2I am trying to implement a fully vectorized neural net following this example Stanford. I am using C# with Math.net numerical library of which the binaries can be downloaded here NuGet. Because the Stanford example is not complete (bias backpropagation and gradient descend missing), I had to try and define these myself from the theory that is linked in the text. As a test I use a simple XOR function that creates input-output sets. In the end I have code that compiles and runs, but it doesn't converge to an answer. It stays at 0.5 for all 4 different input-output sets, so something is wrong.

I am new to Cross Validated. Typically I would post on StackOverflow with a C# tag, but my question is not really a C# question, it's more related to understanding how to implement backpropagation and gradient descent in a vectorized manner.

I am a bit reluctant to dump the whole source code, even though it's a small program (160 lines), so I'll start with the main body. If someone wants to try it I can provide the rest too.

Notations are as per the Stanford example, except that I've used capitals for Matrices. The input X is a matrix of 2 x mSets, where mSets are 4 possible input series of the XOR. W1, W2, b1 and b2 are randomly initialized between 0 and e, where e = 0.01. I have also tried stacking the biases on top of the inputs to the hidden and export layers, but this creates dimensionality problems in the back-propagation. Or, in other words, I don't know how to do it.

My guess is I have the gradient descent wrong. I am hoping that anyone who has been here before will easily spot my mistakes.

UPDATE: I have solved the problem. There are two main issues, one is fundamental: 1) it is not possible to solve the XOR problem with 2x2x1 network without bias, 2) the other has to do with how to implement the bias. The problem is that by introducing the bias, the dimensions are increased, yet during backpropagation, you need to reduce this dimension again. This is shown in the updated code below: remCol() removes the last column from the matrix.

There are 2 sources that helped me: Stephen Welch has a nice series of videos on Youtube which is very similar to the Stanford code, but does not address bias, however Ryan Harris does! His code is in Python, but in chapter 3 (Youtube) he mentions that it is important not to pass back the bias during backpropagation. That was the trick.

// run the network
for (int i = 0; i < maxIter; i++)
{
    // forward propagation
    Z2 = W1 * X.Stack(b1);                   // add a row of 1's at the bottom for bias
    A2 = Z2.Map(SpecialFunctions.Logistic);
    Z3 = W2 * A2.Stack(b1); 
    H = Z3.Map(SpecialFunctions.Logistic);

    // backpropagation
    J = 0.5 * sum(power(Y-H, 2)) / X.RowCount + lambda / 2.0 * (sum(power(W1, 2)) + sum(power(W2, 2)));   // cost function
    delta3 = pointWise(H - Y, Z3.Map(inverseLogistic));                                 // delta output layer
    gradW2 = delta3 * transpose(A2.Stack(b1)) / X.RowCount + lambda * W2;               // grad = partial derivatives, divide by X.Rowcount for scaling and add regularization term
    delta2 = pointWise(transpose(remCol(W2, nHid)) * delta3, Z2.Map(inverseLogistic));     // delta hidden layer, remove last column from matrix (bias)
    gradW1 = delta2 * transpose(X.Stack(b1)) / X.RowCount + lambda * W1;

    // gradient descent
    W1 -= alpha * gradW1;
    W2 -= alpha * gradW2;

    if (i % 100 == 0)
        Console.WriteLine("Error J " + J.ToString("E4"));
}

If you have figured out how your problem is solved, consider a full, documented answer by yourself (and accept it, to visualize this for other users - otherwise all people treat this question as unanswered (unless they read your edit)). Another thing: since we're not on stack overflow here, consider supplying pseudocode rather than real code. As you said, your problem was not code related but a semantical one. — daniel451, Feb 07 '16 at 16:09
Good point, I did answer my own question and marked it as answered, but it was deleted by whuber, here's the quote: Updated post with explanation how problem was solved and updated code. shareeditundeleteflag deleted by whuber♦ Feb 4 at 20:30 Why was your post deleted? See the help center. — jdelange, Feb 07 '16 at 20:37
Maybe your answer was too specific, too "cryptic", too short or just not correct? However, I've added a general explanation for vectorized feedforward activation and backpropagation. I hope this helps you and future readers of your post to clarify any misconception with the vectorized concepts. — daniel451, Feb 07 '16 at 22:37

daniel451 · Answer 1 · 2016-02-07T22:38:31.903

I do not get why you say your bias is increasing your dimensions so I am supplying a general answer although you mentioned you already have solved your problem (but not delivered a clear answer yourself?!).

Therefore I've added a general explanation for network architecture and vectorized feedforward activation as well as vectorized backpropagation. I hope this helps you and future readers of your post to clarify any misconception with the vectorized concepts

Architecture:

Most implementations I know (including my own one's) implement the bias via a vector $\vec{b}$ with the same dimension of the corresponding layer. So if layer $l_1$ has, e.g., 10 neurons then your bias vector $\vec{b_1}$ would have a dimension of $(1, 10)$: 1 row and 10 columns - a flattened 10D vector. Input layer $x$ and output layer $y$ normally are not "counted", just the hidden layers get an enumeration.

If you have an XOR-Solver you would end up with a dimension of (1, 2) for $x$, one hidden layer $l_1$ with (1, 10) in our example and a scalar or (1, 1) / 1D vector (based upon your implementation) for $y$.

Totally your network would have 2 input neurons, 10 hidden neurons, 1 output neurons and 10 bias neurons for your hidden layer.

Your weight matrices would be of dimension $(2, 10)$ for $W_{1}^{x}$ between layer 1 and input layer as well as $(10, 1)$ for $W_{y}^{1}$ between the output layer and layer 1. Weight matrices always have a dimension of (input_size, layer_size).

Vectorized feedforward activation:

The output vector $\vec{s}$ of one layer $i$ would be the activation vector $\vec{h}$ of this layer "normalized" by a transfer function $\varphi$. In your setup $\varphi$ is the logistic/sigmoid function. $\vec{h}$ is the current layer's weight matrix $W_{i}^{i-1}$ multiplied with the output vector $\vec{s}_{i-1}$ of the parent layer (layer $i-1$). After this the bias vector $\vec{b_i}$ is added. So for an arbitrary layer $i$ you get the following vectorized feedforward calculation:

$\vec{s}_{i}(\vec{h}_{i}) = \varphi(\vec{h}_{i}) = \varphi(W_{i}^{i-1} * \vec{s}_{i-1} + \vec{b}_i)$

As you can see, no altering of the dimensionality is needed. The bias vector is simply added.

Vectorized backpropagation:

The error on the output layer $y$ is usually calculated via:

$\delta_{y} = \varphi'(\vec{h}_{y}) * (\vec{y}_{teach} - \vec{y}_{out})$

This is often altered to (see: l1 / l2 loss for absolute and squared error functions):

$\delta_{y} = \left|\vec{y}_{teach} - \vec{y}_{out}\right|\quad$ or $\quad\delta_{y} = (\vec{y}_{teach} - \vec{y}_{out})^2$

Less common in basic MLPs but also very efficient is the cross entropy (see: cross entropy):

$\delta_{y} = -\bigl(\vec{y}_{teach} * log(\vec{y}_{out}) + (1 - \vec{y}_{teach}) * log(1 - \vec{y}_{out})\bigr)$

For backpropagating this error you alter your weights and bias by:

$\Delta W_i^{i-1} = \epsilon * \delta_{i} * \vec{s}_{i-1}$

$\Delta \vec{b}_i = \epsilon * \delta_{i}$

Where $\epsilon$ is your learning rate which is typically something in the interval [0.0001, 0.01] - based on your problem, network architecture, ...

So basically you calculate the error on the last layer, alter its weight matrix and bias by some delta ($\Delta W_i^{i-1}$ and $\Delta \vec{b}_i$) and propagate this until you reach the first layer.

Neural network vectorization, no convergence

1 Answers1