I implemented these two methods in a deep learning project where I am using theano. I understand the mathematical difference between these two methods, and my conceptual understanding is that nesterov is an improvement over momentum.
My question is: are there practical situations where momentum descent would be preferred over nesterov? My experience is that nesterov is always better. What would be a situation in which I would use momentum?