1

Problem: Im unsure if I understood Nesterov Optimization

Im writing about Nesterov Optimization, but the notation im using seems different from the references below. I have done it using some books as guides.

Would someone please clarify?


Let $\epsilon$ be the learning rate, $w$ each weight of the neural network, $\alpha$ the momentum and $E$ a loss function and considering the weights and gradients are calcualted as an unidimensional vector, the weight updates is done as below :

$n_0 = 0 $

$n_t = \alpha * n_{t-1} + \epsilon \frac{\partial E}{\partial w_t}$

And the update for each weight done as the formula below:

$\Delta_{w(t)} = \alpha_{n{t-1}} - {1 - \alpha} n_t$


QUESTIONS

What exactly is $n$ and $t$ ?


References:

https://blogs.princeton.edu/imabandit/2014/03/06/nesterovs-accelerated-gradient-descent-for-smooth-and-strongly-convex-optimization/

http://ruder.io/optimizing-gradient-descent/

What's the difference between momentum based gradient descent, and Nesterov's accelerated gradient descent?

http://neuralnetworksanddeeplearning.com/chap2.html

https://brilliant.org/wiki/backpropagation/

KenobiBastila
  • 351
  • 1
  • 6
  • 17

1 Answers1

2

I think you might be missing something here. I will use the first link you provided as a guide.

Nesterov's method has two steps: The normal gradient update, and then the "nudge" where we move the update a bit according to the update in the previous time step.

From what you write $n$ should be the change in your weights, i.e. is a vector. $t$ is the iteration number. So it doesn't make sense to initialize it to zero, the momentum is what we initialize to 0. Adapting notation from Bubeck's article, and changing the definition of $n$:

$$ n_0 = 0, n_t = \frac{1 + \sqrt{1+4n_{t-1}^2}}{2}, \alpha_t=\frac{1-n_t}{n_{t+1}} $$

Then the update is done in two steps: First get the regular update according to the gradient, previous weights and learning rate, then apply momentum to it:

$$ y_{t+1} = w_t - \epsilon \frac{\partial E}{\partial w_t}\\ w_{t+1} = (1 - \alpha_t)y_{t+1} + \alpha_t y_t $$

Bar
  • 2,492
  • 3
  • 19
  • 31
  • Awesome, would you just please clarify what is $alpha_t$ and $y_{t+1}$ and $w_{t+1}$. ? – KenobiBastila Oct 26 '17 at 16:27
  • $\alpha_t$ is the momentum, $y_{t+1}$ is the intermediate weights (before applying momentum) and $w_{t+1}$ are the updated weights, after momentum has been applied. – Bar Oct 26 '17 at 16:55
  • Alright, just a last question, $n$ is not clear. You said its the change in the weights, but isnt that defined already by $w_{t+1}$ ? – KenobiBastila Oct 26 '17 at 17:26
  • So wait... at the first formula, $n$ should be $\alpha$ ? It's okay if we need to change the letters. Help plx – KenobiBastila Oct 26 '17 at 17:33
  • $n$ and $\alpha$ are what you define them to be. I chose to use $n$ as you were using it as well. They you defined it in the question wouldn't make sense because you initialize it to a scalar, 0, and the assign a vector to it (derivative). Unless you make the assumption that $w$ is a scalar as well, this does not hold. – Bar Oct 27 '17 at 09:37
  • I dont get it :O.. In the equations you wrote, is $n$ the momentum or its something else? – KenobiBastila Oct 27 '17 at 12:58
  • We need two variables to define the momentum, both $n$ and $\alpha$ the final momentum being applied is $\alpha$, you can think of $n$ as an intermediate step, similarly to $y$ and $w$. – Bar Oct 27 '17 at 13:25
  • So $n$ is the momentum in a first moment, and then $\alpha$ is the momentum in the second moment ? – KenobiBastila Oct 27 '17 at 14:05
  • I think you are confusing terminology here. Moments are characteristics of distributions, such as the mean and variance. Momentum in this context is the "speed" that the gradient has accumulated that we use to make it move more in the same direction. – Bar Oct 30 '17 at 09:05
  • Yes I know, im just wondering if both of them are momentums. – KenobiBastila Oct 30 '17 at 10:50
  • In some sense yes, we just use two steps to calculate them to make writing them easier. – Bar Oct 30 '17 at 11:42
  • Hello @KenobiShan I think I've answered this question to the extent necessary, I won't be able to help out more. You can open a new question if you feel like you need more hep. – Bar Oct 30 '17 at 14:04