When do Adaptive Optimization Algorithms modify their parameters?

Question

When do "Ada" optimizers (e.g. Adagrad, Adam, etc...) "adapt" their parameters? Is it at the end of each mini-batch or epoch?

Arya McCarthy · Accepted Answer · 2021-06-03T17:34:57.490

They update their parameters after each mini-batch. (I use this term to avoid confusion with “batch gradient descent”; most neural network libraries talk about a ‘batch size’ when they mean ‘minibatch size’.)

To help remember this: The optimizer has no notion of an ‘epoch’. For instance, for stochastic gradient descent, you could sample randomly from the dataset at every time step (rather than the common shuffle-and-iterate strategy) and it’ll still work. (It’s your job to define the training curriculum, not the optimizer’s.)

In that case, there’s no clearly defined ‘epoch’. Everything is in terms of which mini-batch the optimizer processes.

Yes, I meant mini-batch, editing question now. It makes sense, in documentation I sometimes read "after each gradient update" and that happens after processing a mini-batch. — Marsellus Wallace, Jun 03 '21 at 17:43

When do Adaptive Optimization Algorithms modify their parameters?

1 Answers1