Does Keras SGD optimizer implement batch, mini-batch, or stochastic gradient descent?

Question

I am a newbie in Deep Learning libraries and thus decided to go with Keras. While implementing a NN model, I saw the batch_size parameter in model.fit().

Now, I was wondering if I use the SGD optimizer, and then set the batch_size = 1, m and b, where m = no. of training examples and 1 < b < m, then I would be actually implementing Stochastic, Batch and Mini-Batch Gradient Descent respectively. However, on the other hand, I felt using SGD as the optimizer would by default ignore the batch_size parameter, since SGD stands for Stochastic Gradient Descent and it should always use a batch_size of 1 (i.e use a single data point for each iteration of gradient descent).

I would be grateful if someone could clarify as to which of the above two cases is true.

score 9 · Accepted Answer · answered May 02 '19 at 06:48

9

It works just as you suggest. batch_size parameter does exactly what you would expect: it sets the size of the batch:

batch_size: Integer or None. Number of samples per gradient update. If unspecified, batch_size will default to 32.

From programming point of view, Keras decouples the weight update formula parameters specific to each optimizer (learning rate, momentum, etc.) from the global training properties (batch size, training length, etc.) that are share between methods. It is matter of convenience—there is no point in having optimizers SGD, MBGD, BGD that all do the same thing just with different batch size.

answered May 02 '19 at 06:48

Jan Kukacka

10,121
1
36
62

Sorry, but how does this specify Gradient vs Stochastic Gradient versus mini-batch Gradient? A setting of 1 would be regular Gradient descent, right? A setting other than 1 would be Stochastic. There is no way to use mini-batch, as you are not saying "never use all samples". You are only saying "update after this many samples" – VISQL Oct 08 '19 at 15:27
2

If you specify `batch_size` to be the size of the whole dataset, you get *batch gradient descent* (i.e. nothing stochastic). Anything smaller is *mini-batch g.d.*, which is stochastic. Whether we call the special case of batch size = 1 something else, that's just a matter of nomenclature. – Jan Kukacka Oct 08 '19 at 16:04
@JanKukacka, thank you for clearing this up. I had to re-read a few things as well. – VISQL Oct 09 '19 at 09:05
@JanKukacka you should update your answer with your comment. – Viet Feb 08 '20 at 10:45

Does Keras SGD optimizer implement batch, mini-batch, or stochastic gradient descent?

1 Answers1

Linked