4

Optimization algorithms like Adagrad and ADAM decay your learning rate over time. To me this sounds like a bad idea for online training since you're always getting new data as opposed to retraining on the same data for multiple epochs in offline.

Suppose I could use Adagrad or ADAM for online training, would the learning rate I find using grid search for offline training be suitable for online training? I'd imagine not.

amoeba
  • 93,463
  • 28
  • 275
  • 317
user3768533
  • 141
  • 1

1 Answers1

1

If I understood correctly Adagrad will decay the learning rate as there's a matrix $G_t=\sum g_\tau g_\tau^T$ whose value is always increasing, while in Adam a similar matrix is estimated by moving average to avoid such decay, and it seems the idea of moving average fits well the context of online learning.

dontloo
  • 13,692
  • 7
  • 51
  • 80