270

I'm using Python Keras package for neural network. This is the link. Is batch_size equals to number of test samples? From Wikipedia we have this information:

However, in other cases, evaluating the sum-gradient may require expensive evaluations of the gradients from all summand functions. When the training set is enormous and no simple formulas exist, evaluating the sums of gradients becomes very expensive, because evaluating the gradient requires evaluating all the summand functions' gradients. To economize on the computational cost at every iteration, stochastic gradient descent samples a subset of summand functions at every step. This is very effective in the case of large-scale machine learning problems.

Above information is describing test data? Is this same as batch_size in keras (Number of samples per gradient update)?

pasbi
  • 105
  • 4
user2991243
  • 3,621
  • 4
  • 22
  • 48
  • 3
    It's good to see https://class.coursera.org/ml-005/lecture/preview course, especially for you week 4-6 + 10. Wikipedia may be not so valuable resource for learning neural networks. – 404pio May 22 '15 at 10:48

6 Answers6

372

The batch size defines the number of samples that will be propagated through the network.

For instance, let's say you have 1050 training samples and you want to set up a batch_size equal to 100. The algorithm takes the first 100 samples (from 1st to 100th) from the training dataset and trains the network. Next, it takes the second 100 samples (from 101st to 200th) and trains the network again. We can keep doing this procedure until we have propagated all samples through of the network. Problem might happen with the last set of samples. In our example, we've used 1050 which is not divisible by 100 without remainder. The simplest solution is just to get the final 50 samples and train the network.

Advantages of using a batch size < number of all samples:

  • It requires less memory. Since you train the network using fewer samples, the overall training procedure requires less memory. That's especially important if you are not able to fit the whole dataset in your machine's memory.

  • Typically networks train faster with mini-batches. That's because we update the weights after each propagation. In our example we've propagated 11 batches (10 of them had 100 samples and 1 had 50 samples) and after each of them we've updated our network's parameters. If we used all samples during propagation we would make only 1 update for the network's parameter.

Disadvantages of using a batch size < number of all samples:

  • The smaller the batch the less accurate the estimate of the gradient will be. In the figure below, you can see that the direction of the mini-batch gradient (green color) fluctuates much more in comparison to the direction of the full batch gradient (blue color).

Gradient directions for different batch setups

Stochastic is just a mini-batch with batch_size equal to 1. In that case, the gradient changes its direction even more often than a mini-batch gradient.

itdxer
  • 7,104
  • 1
  • 18
  • 28
  • Thank you for answer. do you work with `Keras`? anyway to set test data in this package? – user2991243 May 22 '15 at 09:50
  • 3
    No, I didn't. This is popular technique in neural networks and this terminology you can see in different libraries, books and articles. Do you want check test data error in every epoch or just verify model after training? – itdxer May 22 '15 at 09:55
  • Yes. That's true. Similar structure we have in `MATLAB` but i found only train and validation data-sets here. I think here in this package validation data-set is same as test data but there isn't early stopping so we don't have any real validation data. – user2991243 May 22 '15 at 09:57
  • 1
    The network also converges faster as the number of updates is considerable higher. Setting up the mini batch size is kind of an art, too small and you risk making your learning too stochastic, faster but will converge to unreliable models, too big and it wont fit into memory and still take ages. – Ramalho May 23 '15 at 00:06
  • 1
    Does this mean that `batch_size=` are considered online learning, or rather `batch_size=1`? And does all of this remain true for RNNs as well? When using `batch_size` in RNNs, is the batch considered a sort of _virtual timestep_ in that all the instances in that batch will be computed as if they occurred at once? – ehiller Aug 07 '17 at 16:24
  • 3
    Typically when people say online learning they mean `batch_size=1`. The idea behind online learning is that you update your model as soon as you see the example. With larger batch size it means that first you are looking through the multiple samples before doing update. In RNN size of the batch can have different meanings. Usually, It's common to split training sequence into window of fixed size (like 10 words). In this case including 100 of these windows during the training will mean that you have `batch_size=100`. – itdxer Aug 07 '17 at 18:29
  • @itdxer: "The problem usually happens with the last set of samples." What exactly is the problem? So, the last batch carries 50 samples, but is designed to carry 100. I don't see a problem here, besides a small nuisance of half-wasted batch in the last step only. What am I missing? – Oleg Melnikov Dec 26 '17 at 17:27
  • 2
    @Oleg Melnikov, if your last batch has significantly smaller size (let's say it would be 1 instead of 50) then estimate for the gradient would be less accurate and it can screw up a bit your weights. In the image above, imagine that you make 10 updates with a mini batch 100 (green lines) and one with mini batch 1 (red line). Which means that in the next epoch a few first iteration can start solving problem with last mini batch 1 update from the previous epoch. – itdxer Dec 26 '17 at 19:57
  • @itdxer. Why would the gradient be less accurate? It seems you may be assuming that in the implementation of Keras and TF the last batch would be padded with some noise that would erade the gradient. Is that so? Anyhow, something to ponder upon ;) – Oleg Melnikov Dec 26 '17 at 20:48
  • 1
    @OlegMelnikov MIT deep learning book has a good explanation related to this problem (chapter 8.1.3): http://www.deeplearningbook.org/contents/optimization.html – itdxer Dec 26 '17 at 21:16
  • Sounds like this answer is incorrect or confusing. From what I know batch size is the number of items from the data set it takes to trigger the weight adjustment. So if you use batch-size 1 you update weights after every sample. If you use batch size 10, you calculate average error and then update weights every 10 samples. – Alexus Mar 29 '18 at 20:08
  • Batch commonly used as a terminology for training number of samples, but it's not required to apply training in order to call it batch. If you have a database with 100M entities that you want to classify, you will still have to split it into batches and do you prediction per batch (even if you want to distribute it into many machines). In fact, many libraries will use batch size terminology for these cases (you can check Keras doc). With batch size propagate all 10 examples at the same time, but gradient will be calculated per average error, since it's more efficient. – itdxer Mar 30 '18 at 08:22
  • Yet another advantage of mini-batch gradient descent is that it can jump out of local minimas, if the cost function is not convex. So the disadvantage mentioned in the Answer may actually be an advantage in these scenarios. – flow2k Jul 19 '19 at 22:31
222

In the neural network terminology:

  • one epoch = one forward pass and one backward pass of all the training examples
  • batch size = the number of training examples in one forward/backward pass. The higher the batch size, the more memory space you'll need.
  • number of iterations = number of passes, each pass using [batch size] number of examples. To be clear, one pass = one forward pass + one backward pass (we do not count the forward pass and backward pass as two different passes).

Example: if you have 1000 training examples, and your batch size is 500, then it will take 2 iterations to complete 1 epoch.

FYI: Tradeoff batch size vs. number of iterations to train a neural network

Franck Dernoncourt
  • 42,093
  • 30
  • 155
  • 271
  • But whats the difference between using [batch size] numbers of examples and train the network on each example and proceed with the next [batch size] numbers examples. Since you pass one example through the network and apply SGD and take the next example and so on it will make no difference if the batch size is 10 or 1000 or 100000. After [batch size] numbers of examples is done the next example of the next batch will follow. It only makes a difference if [batch size] numbers of example will pass [number of iterations] times the network and then proceed with the next [batch size] examples. – Erhard Dinhobl Mar 26 '17 at 08:31
  • An important distinction is that the learning step (one step) is applied once for every batch, while you have to cycle through all batches to make one epoch. So the difference is also algorithmic not only in memory: bigger batches mean that you average the gradient over more samples. – meduz Sep 07 '18 at 07:19
  • 1
    What's the difference between epoch and iterations? – JobHunter69 Jun 09 '19 at 20:37
  • 2
    @Goldname 1 epoch includes all the training examples whereas 1 iteration includes only [batch size] number of training examples. – Franck Dernoncourt Jun 09 '19 at 20:39
  • So: If the number of epochs is constant, and we're plotting the convergence plot with each point representing the result after each epoch, we can expect the resulting plot to be 'smoother' (and the training to be slower) as we decrease `batch_size`? – Itamar Mushkin Nov 21 '19 at 13:04
9

The question has been asked a while ago but I think people are still tumbling across it. For me it helped to know about the mathematical background to understand batching and where the advantages/disadvantages mentioned in itdxer's answer come from. So please take this as a complementary explanation to the accepted answer.

Consider Gradient Descent as an optimization algorithm to minimize your Loss function $J(\theta)$. The updating step in Gradient Descent is given by

$$\theta_{k+1} = \theta_{k} - \alpha \nabla J(\theta)$$

For simplicity let's assume you only have 1 parameter ($n=1$), but you have a total of 1050 training samples ($m = 1050$) as suggested by itdxer.

Full-Batch Gradient Descent

In Batch Gradient Descent one computes the gradient for a batch of training samples first (represented by the sum in below equation, here the batch comprises all samples $m$ = full-batch) and then updates the parameter:

$$\theta_{k+1} = \theta_{k} - \alpha \sum^m_{j=1} \nabla J_j(\theta)$$

This is what is described in the wikipedia excerpt from the OP. For large number of training samples, the updating step becomes very expensive since the gradient has to be evaluated for each summand.

Stochastic Gradient Descent

In Stochastic Gradient Descent one computes the gradient for one training sample and updates the paramter immediately. These two steps are repeated for all training samples.

for each sample j compute:

$$\theta_{k+1} = \theta_{k} - \alpha \nabla J_j(\theta)$$

One updating step is less expensive since the gradient is only evaluated for a single training sample j.

Difference between both approaches

Updating Speed: Batch gradient descent tends to converge more slowly because the gradient has to be computed for all training samples before updating. Within the same number of computation steps, Stochastic Gradient Descent already updated the parameter multiple times. But why should we then even choose Batch Gradient Descent?

Convergence Direction: Faster updating speed comes at the cost of lower "accuracy". Since in Stochastic Gradient Descent we only incorporate a single training sample to estimate the gradient it does not converge as directly as batch gradient descent. One could say, that the amount of information in each updating step is lower in SGD compared to BGD.

The less direct convergence is nicely depicted in itdxer's answer. Full-Batch has the most direct route of convergence, where as mini-batch or stochastic fluctuate a lot more. Also with SDG it can happen theoretically happen, that the solution never fully converges.

Memory Capacity: As pointed out by itdxer feeding training samples as batches requires memory capacity to load the batches. The greater the batch, the more memory capacity is required.

Summary

In my example I used Gradient Descent and no particular loss function, but the concept stays the same since optimization on computers basically always comprises iterative approaches.

So, by batching you have influence over training speed (smaller batch size) vs. gradient estimation accuracy (larger batch size). By choosing the batch size you define how many training samples are combined to estimate the gradient before updating the parameter(s).

Arya McCarthy
  • 6,390
  • 1
  • 16
  • 47
Wellenprinz
  • 91
  • 1
  • 2
  • Thanks alot, I was still confused about why having different batches impact our weights, and your explanation about the two versions off gradient descent made it crystal clear. – Mohsen Sichani Dec 14 '21 at 08:05
7

When solving with a CPU or a GPU an Optimization Problem, you iteratively apply an Algorithm over some Input Data. In each of these iterations you usually update a Metric of your problem doing some Calculations on the Data. Now when the size of your data is large it might need a considerable amount of time to complete every iteration, and may consume a lot of resources. So sometimes you choose to apply these iterative calculations on a Portion of the Data to save time and computational resources. This portion is the batch_size and the process is called (in the Neural Network Lingo) batch data processing. When you apply your computations on all your data, then you do online data processing. I guess the terminology comes from the 60s, and even before. Does anyone remember the .bat DOS files? But of course the concept incarnated to mean a thread or portion of the data to be used.

pebox11
  • 199
  • 3
  • 4
  • It is more precise to call "min-batch processing" since "batch processing" refers to use the entire dataset, not a portion of it. – little_monster Jul 07 '21 at 03:32
5

The documentation for Keras about batch size can be found under the fit function in the Models (functional API) page

batch_size: Integer or None. Number of samples per gradient update. If unspecified, batch_size will default to 32.

If you have a small dataset, it would be best to make the batch size equal to the size of the training data. First try with a small batch then increase to save time. As itdxer mentioned, there's a tradeoff between accuracy and speed.

otayeby
  • 159
  • 1
  • 2
0

Batch size is a hyperparameter that define number of sample to work through before updating internal model parameters.