1

I'm going through this deck but don't quite get the difference between momentum and Polyak averaging, and what role Polyak averaging plays in modern optimizers.

For example, is it correct to say that in momentum one averages parameter gradients while in Polyak we average parameter values?

From what I gather, Adam uses bias-corrected, running

  • averages of gradients (1st moment)
  • second-order moments of gradients

Has the use of Polyak averaging been studied in combination with Adam? In what cases is it expected to help?

Josh
  • 3,408
  • 4
  • 22
  • 46
  • Classic momentum works a little like EWMA, where farther in the past has (much) lower weight, and it is done each iteration. It looks like Polyak is about equal weights or at least not decayed weights over a window of successive updates. If momentum is 0.8, then parameter weight for 5 steps ago is $0.8^{5} = 0.32$. If all values were equal then 5 steps ago would be 1/6 = 0.16. If the past has lower weights, learning might be faster. – EngrStudent Jul 13 '20 at 22:40

0 Answers0