This might be a silly question, but here it is anyway. I'm trying to implement Nesterov's Momentum to extend the gradient descent algorithm that I'm currently using for my neural network, where I'm currently using momentum.
Now, I know that applying Nesterov's momentum simply amounts to evaluating the gradient at a shifted point, that is, W_shifted = W_current + α * ΔW_old
(where W_current
are the current weights, ΔW_old
is the weight update at last iteration and α
is the momentum) and then go on with the usual steps of gradient descent.
The question is: when evaluating the loss function at each iteration, should I compute the network's output at W_shifted
or at W_current
?