I just wondered if there are cases where small or very small learning rates in gradient descent based optimization are useful?
A large learning rate allows the model to explore a much larger portion of the parameter space. Small learning rates, on the other hand, can take the model a long time before it converges.
In which cases are small learning rates particularly useful?