Are the training samples shuffled in minibatch gradient descent?

Question

Lets say that I have 3,200 observations that I want to use for training a neural network model, and I want to set the batch size to 32. The number of minibatches used in every epoch for updating the weights is therefore 100.

How will my data be divided during minibatching? Assuming I have datapoints $x_{1}, x_{2}, ... , x_{3200}$ - will minibatch slice the data as it is, and use the same slices in every epoch? e.g. $$ Epoch 1: minibatch1: [x_{1}, x_{2}, ... , x_{32}], minibatch2: [x_{33}, x_{34}, ... , x_{64}] \\ Epoch 2: minibatch1: [x_{1}, x_{2}, ... , x_{32}], minibatch2: [x_{33}, x_{34}, ... , x_{64}]\\ Epoch 40: minibatch1: [x_{1}, x_{2}, ... , x_{32}], minibatch2: [x_{33}, x_{34}, ... , x_{64}] $$

Or are the slices shuffled in every epoch, such that there are different combinations of observations in every minibatch in every training epoch? e.g. $$ Epoch 1: minibatch1: [x_{67}, x_{2891}, ... , x_{930}], minibatch2: [x_{102}, x_{7}, ... , x_{1241}] \\ Epoch 2: minibatch1: [x_{3174}, x_{15}, ... , x_{412}], minibatch2: [x_{753}, x_{2447}, ... , x_{1630}]\\ Epoch 40: minibatch1: [x_{456}, x_{73}, ... , x_{1984}], minibatch2: [x_{29}, x_{675}, ... , x_{2354}] $$

score 4 · Answer 1 · answered Mar 12 '21 at 12:22

Mini-batch learning is a middle ground between gradient descent (compute and collect all gradients, then do a single step of weight changes) and stochastic gradient descent (SGD; for every data point, compute the gradient, and update the weights). Mini-batch (we average gradients over smaller batches and then update) trades off statistical and computational efficiency. In both SGD and mini-batch, we typically sample without replacement, that is, repeated passes through the dataset traverse it in a different random order.

score 3 · Answer 2 · answered Mar 12 '21 at 11:58

TenserFlow, PyTorch, Chainer and all the good ML packages can shuffle the batches. There is a command say shuffle=True, and it is set by default.

Also what happens with the last batch may be important for you. Last batch may be smaller in size comparing all other batches. This is easy to understand because if you have say 100 examples and your batch size is 30, last batch will have just 10 samples.

This is why there is an option usually (read: always) to drop the last batch.

Are the training samples shuffled in minibatch gradient descent?

2 Answers2