Implementation of Nesterov's Accelerate Gradient for Neural Networks

Question

I have implemented NAG following this tutorial

http://cs231n.github.io/neural-networks-3/#ada

It works, in fact with mu = 0.95 I get a good speed-up in learning compared to standard gradient descent, but I am not sure I implemented it correctly. I have a doubt about the gradient estimate I am using. This is the matlab code:

% backup previous velocity
vb_prev = nn.vb{l};
vW_prev = nn.vW{l};

% update velocity with adapted learning rates
nn.vb{l} = nn.mu*vb_prev - lrb.*nn.db{l};
nn.vW{l} = nn.mu*vW_prev - lrW.*nn.dW{l};

%update weights and biases 
nn.b{l} = nn.b{l} - nn.mu*vb_prev + (1 + nn.mu)*nn.vb{l};
nn.W{l} = nn.W{l} - nn.mu*vW_prev + (1 + nn.mu)*nn.vW{l};

which is exactly what is written in the tutorial. db and dW are my gradient estimates, but I am not sure if they are correct since I estimate them at W, not at W_ahead = W + mu*v.

Should I change something before forward and back-propagation to have a different gradient estimate? Or is NAG correct like this?

It works, but I would like to have a correct implementation of NAG.

Any help is appreciated, thank you very much!

does [this answer](http://stats.stackexchange.com/a/191727/95569) help? — dontloo, Aug 02 '16 at 09:56
Ok thanks. I think the other answer helps, it seems applecider had a very similar implementation as mine and according to dontloo it should achievethe correct NAG update. But I don't fully understand how it can work properly if the gradient is computed always at W and not at W_ahead = W + mu*v. — Steve3nto, Aug 02 '16 at 13:49

Implementation of Nesterov's Accelerate Gradient for Neural Networks

0 Answers0