Why divide the learning rate by the size of the mini batch?

Asked Feb 17 '19 at 17:12

Active Feb 17 '19 at 17:12

Viewed 237 times

In Michael Nielsen's online book Neural Networks and Deep Learning, in chapter one (and onwards) he divides the learning rate, $\eta$, by the size of the mini batch when he performs stochastic gradient descent (github link). Why?

Effectively, this means that when large mini batches, e.g. all of the training data, is used to do SGD, a very small learning step is made. Reversely, if only a very small amount of data is used to to SGD, then a very large learning step is made. This seems very unintuitive to me.

There are some related questions on this, but they are more focussed on why increasing the batch size can have the same regularizing effect as lowering the learning rate.

asked Feb 17 '19 at 17:12

Ulf Aslak

Why divide the learning rate by the size of the mini batch?

0 Answers0