1

My understanding is that:

  • With momentum, one can avoid e.g. "zig-zags" during gradient decent by averaging gradients to determine a better direction of descent.
  • With adaptive step size methods like AdaGrad and RMSProp one accumulates gradients and uses a norm of this accumulated gradient to use a larger or smaller step size per dimension to pay more attention to sparse features.

On the surface of it, these methods seem to do different things by changing direction vs step sizes (per dimension), but aren't the concepts of direction and the step size per dimension related? Moreover, don't they both accumulate gradients?

  • For example, when we accumulate gradients to decide the step size for each dimension, isn't that technically mathematically equivalent to changing the direction of descent as well?

  • Or put another way, couldn't one argue that momentum can be reinterpreted as a method that uses a different step size per dimension based on the accumulated (averaged) gradient?

Josh
  • 3,408
  • 4
  • 22
  • 46

0 Answers0