0

When do "Ada" optimizers (e.g. Adagrad, Adam, etc...) "adapt" their parameters? Is it at the end of each mini-batch or epoch?

1 Answers1

2

They update their parameters after each mini-batch. (I use this term to avoid confusion with “batch gradient descent”; most neural network libraries talk about a ‘batch size’ when they mean ‘minibatch size’.)

To help remember this: The optimizer has no notion of an ‘epoch’. For instance, for stochastic gradient descent, you could sample randomly from the dataset at every time step (rather than the common shuffle-and-iterate strategy) and it’ll still work. (It’s your job to define the training curriculum, not the optimizer’s.)

In that case, there’s no clearly defined ‘epoch’. Everything is in terms of which mini-batch the optimizer processes.

Arya McCarthy
  • 6,390
  • 1
  • 16
  • 47
  • Yes, I meant mini-batch, editing question now. It makes sense, in documentation I sometimes read "after each gradient update" and that happens after processing a mini-batch. – Marsellus Wallace Jun 03 '21 at 17:43