With a small batchsize, I believe the SGD descent direction becomes a very noisy estimate of the "true" descent direction (i.e. if we evaluated it on the entire training set). With a small batchsize, I am not sure how much increasing the momentum would help as it would be accumulating momentum in very noisy directions. But I could be wrong, maybe your optimization problem is well-posed enough where this could work.
If you aren't gunning for "state of the art" results, one option you have for natural image data is to resize the images. I actually think that modulo chasing down elite benchmarking performance, natural images have a lot of scale invariant properties and a lot of their semantic features are fairly robust under reasonable scaling transformations. This would alleviate some of the GPU memory and allow you to increase your batchsize and your SGD descent directions would be better estimates of the descent directions.
If you are dealing with a separable loss function like negative log likelihood, we can exploit the fact that the gradient of a large batch is merely the sum/average of the gradients of its constituent sub-batches. For example if our batchsize is $B$, we can compute gradients of a super batchsize $BK$ by iterating through the batches as usual, computing each batch gradient, but instead of updating the weights, we cache each gradient into a running sum or average. If we average appropriately, we will be computing the exact gradient for the $BK$ sized super batch. We then perform the weight update after each $K$-th batch has been processed.
We will be exactly computing the $BK$ batch gradients by serializing the computation as described above. There is minimal extra computational or memory overhead, one only needs to modify the minibatch iterator to include the super batch serialization and gradient cache.