1

I just wondered if there are cases where small or very small learning rates in gradient descent based optimization are useful?

A large learning rate allows the model to explore a much larger portion of the parameter space. Small learning rates, on the other hand, can take the model a long time before it converges.

In which cases are small learning rates particularly useful?

Samuel
  • 585
  • 4
  • 15
  • 4
    If the learning rate is "too large," then the optimization can diverge. https://stats.stackexchange.com/questions/364360/how-can-change-in-cost-function-be-positive/364366#364366 A learning rate that is "small" in absolute terms might be the largest values that doesn't exhibit instability. – Sycorax Jul 15 '21 at 19:52
  • @Sycorax is right. Some food for thought is that [Smith 2017](https://arxiv.org/abs/1711.00489) suggests that increasing the batch size is preferable to decaying the learning rate, but perhaps increasing the batch size isn't feasible in situations where decreasing the learning rate (either a priori or via decay) is possible. – DifferentialPleiometry Jul 15 '21 at 20:04

0 Answers0