Dealing with small batch size in SGD training

Question

I am trying to train a large model (deep net using caffe) using stochastic gradient descent (SGD).
The problem is I am constraint by my GPU memory capacity and thus cannot process large mini-batches for each stochastic gradient estimation.

How can I overcome this instability in my training?

One thought I had was to use momentum, and set it to a higher value than the default is usually set to. Is this a valid strategy?

For those of you who are happen to use Caffe, it might be interesting to know that caffe has already implemented the gradient accumulation across mini-batches (as suggested by Indie Al). You simply need to define iter_size in the 'solver.prototxt'.

This can also be done in pytorch. See this post for example.

Please do not cross-post. Pick the site where you want to post & delete the other thread. — gung - Reinstate Monica, Mar 15 '16 at 08:58
@gung currently, I get no responses on both sites. Once I get a response on any of the sites, I'll remove from the other. — Shai, Mar 15 '16 at 08:59
@gung getting a pulse here, I removed the original question for SO. — Shai, Mar 15 '16 at 15:54
In the original [paper](https://arxiv.org/abs/1505.04597) introducing U-Net, the authors mention that they reduced the batch size to 1 (so they went from mini-batch GD to SGD) and compensated by adopting a momentum of 0.99. They got SOTA results, but it's hard to determine what role this decision played. — David Cian, Feb 11 '21 at 13:39
@DavidCian on the other hand there's [this paper](https://openreview.net/forum?id=B1Yy1BxCZ) that suggest a linear relation between the learning rate and the batch size... looks more like dark arts than science... :O — Shai, Feb 11 '21 at 13:46
Well, the two aren't necessarily contradictory. In fact, the abstract of the paper you mentioned states exactly this: either you increase the batch size, or you increase momentum and decrease batch size (although they state the latter slightly decreases test accuracy, which seems about right). — David Cian, Feb 11 '21 at 16:01

Indie AI · Accepted Answer · 2016-03-15T20:11:50.257

5

With a small batchsize, I believe the SGD descent direction becomes a very noisy estimate of the "true" descent direction (i.e. if we evaluated it on the entire training set). With a small batchsize, I am not sure how much increasing the momentum would help as it would be accumulating momentum in very noisy directions. But I could be wrong, maybe your optimization problem is well-posed enough where this could work.

If you aren't gunning for "state of the art" results, one option you have for natural image data is to resize the images. I actually think that modulo chasing down elite benchmarking performance, natural images have a lot of scale invariant properties and a lot of their semantic features are fairly robust under reasonable scaling transformations. This would alleviate some of the GPU memory and allow you to increase your batchsize and your SGD descent directions would be better estimates of the descent directions.

If you are dealing with a separable loss function like negative log likelihood, we can exploit the fact that the gradient of a large batch is merely the sum/average of the gradients of its constituent sub-batches. For example if our batchsize is $B$, we can compute gradients of a super batchsize $BK$ by iterating through the batches as usual, computing each batch gradient, but instead of updating the weights, we cache each gradient into a running sum or average. If we average appropriately, we will be computing the exact gradient for the $BK$ sized super batch. We then perform the weight update after each $K$-th batch has been processed.

We will be exactly computing the $BK$ batch gradients by serializing the computation as described above. There is minimal extra computational or memory overhead, one only needs to modify the minibatch iterator to include the super batch serialization and gradient cache.

edited Mar 15 '16 at 20:11

answered Mar 15 '16 at 15:08

Indie AI

6,702
2
28
32

I already scaled down whatever inputs I was able to scale... It's down to mini batch size now... – Shai Mar 15 '16 at 15:11
thank you for your answer. The thing is I really aim at getting an expert opinion on the ability of momentum to work as a "gradient noise reduction" component – Shai Mar 15 '16 at 15:15
Ah I see, I have updated my answer to suggest another possibility then. I haven't tried it out, but maybe it can work for you? Hope it helps, even if to figure what what doesn't work. – Indie AI Mar 15 '16 at 16:27
it's an interesting approach, but in many cases you need the "forward" information in order to compute the "backward" gradient. Therefore, you need to "re-forward" all the mini-batches you want to average in order to estimate the gradient - this is not feasible in terms of training time - I'd move to CPU and get the same speed with larger batches. – Shai Mar 15 '16 at 16:35
You would process the batches as you always would, store a cache term for the gradients, and compute a running sum of the gradients as you iterate through each training batch. After every K batches are processed, you take your accumulated gradients and update your weights, this is the same as computing the gradient over the K times larger batch. I don't see why you would need to compute another forward pass? If your loss function is linearly separable, something like negative log likelihood, the gradient over a minibatch is merely the sum/avg/weighting of gradients over each summand. – Indie AI Mar 15 '16 at 16:41
1

Okay, it makes more sense now. Although it does sounds a lot like increased momentum... See the last paragraph [here](http://ufldl.stanford.edu/tutorial/supervised/OptimizationStochasticGradientDescent/): "[momentum] determines for how many iterations the previous gradients are incorporated into the current update" – Shai Mar 15 '16 at 16:49
You only update momentum/velocity when you update the weights, not with each computation of the gradient. This is just a serialized manner of splitting up the exact computation of a large batch into smaller batches. – Indie AI Mar 15 '16 at 17:04

score 0 · Answer 2 · answered Oct 24 '18 at 06:04

Recently I came across an interesting work:

Samuel L. Smith, Pieter-Jan Kindermans, Chris Ying, Quoc V. Le, Don't Decay the Learning Rate, Increase the Batch Size (ICLR 2018).

This works shows a direct link between batch size and learning rate. Specifically, decreasing learning rate has the same effect as increasing batch size and vice versa.
Taking their conclusion to extreme, one might consider decreasing the batch size and compensating for it by increasing the learning rate.
I haven't actually tried it yet, though.

Dealing with small batch size in SGD training

2 Answers2