Can small SGD batch size lead to faster overfitting?

Question

I have feedforward neural net, trained on cca 34k samples and tested on 8k samples. There is 139 features in dataset. The ANN does classification between two labels, 0 and 1, so I am using sigmoid function in last layer and two hidden layers, both with 400 units. NN is created using following Keras code:

model = Sequential()
model.add(Dense(units=139, input_dim=len(list(X))))

model.add(Dense(units=400))
model.add(Activation('relu'))
model.add(Dropout(0.2))

model.add(Dense(units=400))
model.add(Activation('relu'))
model.add(Dropout(0.2))

model.add(Dense(units=1))
model.add(Activation('sigmoid'))

model.compile(optimizer='sgd',
          loss='binary_crossentropy',
          metrics=['accuracy'])

checkpoint = ModelCheckpoint('tempmodelKeras.h5',period=1)
custom = LambdaCallback(on_epoch_end=lambda epoch,logs: test_callback_wrapper())
model.fit(X, y, epochs=500, batch_size=128, callbacks=[checkpoint,custom])

test_callback_wrapper() is just for testing model after each epoch on test dataset and then calculating average precision score for different thresholds.

Now the part I need to help with: Following images shows epoch number on X axis, and average precision on Y axis, for test set.

I tried to three different batch sizes (32, 128, 256), but if you look at the plot, it looks that smaller batch sizes are 'faster' in terms of number of epoch needed to reach maximum average precision, but are more prone to overfitting. However, I read several articles, where was written that larger batches usually leads to overfitting and smaller batches are better. How is it possible that in my image, it looks vice versa, that larger batches perform better?

Welcome to the site! I have to admit this is something new to me, intuitively I could not think of a reason why larger batches would result in overfitting. However there seems to be literature, as you point out, that reports this phenomenon. Have you seen this [thread](https://stats.stackexchange.com/questions/164876/tradeoff-batch-size-vs-number-of-iterations-to-train-a-neural-network)? Also you are on the verge of answering your own question, why not plot the in-sample results alongside out-of-sample and then you can update the question? — Zhubarb, Apr 25 '19 at 16:34

score 7 · Accepted Answer · answered Apr 26 '19 at 10:29

Using "epoch" on the x-axis to compare "speed" of convergence while comparing different batch sizes makes no sense since the number of weight updates per epoch depends on the batch size. When using smaller batches, the learning algorithm performs more weight updates per epoch and it naturally it seems to converge faster.

In your image, the gray curve (batch size = 32) starts decreasing (overfitting) around epoch 40. The orange curve (batch size = 128, four times more) seems to peak around epoch 160—four times later, which is exactly after the same number of weight updates. Possibly you might see similar peak on the blue curve around epoch 320.

Generally, smaller batches lead to noisier gradient estimates and are better capable to escape poor local minima and prevent overfitting. On the other hand, tiny batches may be too noisy for good learning. In the end, it is just another hyperparameter that one needs to tune on the specific dataset. In your case, it does not seem to play a big role, since all curves peak around the same accuracy, and early stopping should lead to the same performance regardless of the batch size.

Thanks. This helps me a lot. I understood that smaller batch sizes can be noisier, but this explains me why they reach peak faster and then starts overfitting. — Jan Musil, Apr 26 '19 at 11:04
Great answer. Note that when the dataset is small enough or homogenous enough, it won't matter if the batch size is 32 or 128, from the aspect of accurately approximating the loss and gradient (they will both be very accurate), you'll just need to wait x4 more epochs with b=128, as @Jan said in his answer, to have the same number of weight updates. — SomethingSomething, Feb 17 '21 at 16:56

Can small SGD batch size lead to faster overfitting?

1 Answers1