I have implemented NAG following this tutorial
http://cs231n.github.io/neural-networks-3/#ada
It works, in fact with mu = 0.95 I get a good speed-up in learning compared to standard gradient descent, but I am not sure I implemented it correctly. I have a doubt about the gradient estimate I am using. This is the matlab code:
% backup previous velocity
vb_prev = nn.vb{l};
vW_prev = nn.vW{l};
% update velocity with adapted learning rates
nn.vb{l} = nn.mu*vb_prev - lrb.*nn.db{l};
nn.vW{l} = nn.mu*vW_prev - lrW.*nn.dW{l};
%update weights and biases
nn.b{l} = nn.b{l} - nn.mu*vb_prev + (1 + nn.mu)*nn.vb{l};
nn.W{l} = nn.W{l} - nn.mu*vW_prev + (1 + nn.mu)*nn.vW{l};
which is exactly what is written in the tutorial. db and dW are my gradient estimates, but I am not sure if they are correct since I estimate them at W, not at W_ahead = W + mu*v.
Should I change something before forward and back-propagation to have a different gradient estimate? Or is NAG correct like this?
It works, but I would like to have a correct implementation of NAG.
Any help is appreciated, thank you very much!