I'm going through this deck but don't quite get the difference between momentum and Polyak averaging, and what role Polyak averaging plays in modern optimizers.
For example, is it correct to say that in momentum one averages parameter gradients while in Polyak we average parameter values?
From what I gather, Adam uses bias-corrected, running
- averages of gradients (1st moment)
- second-order moments of gradients
Has the use of Polyak averaging been studied in combination with Adam? In what cases is it expected to help?