What is the difference between gradient descent and batch gradient descent?

Question

It seems that batch gradient descent is the traditional gradient descent, except that the objective function is in the form of summation?

Does this answer your question? [Gradient Descent (GD) vs Stochastic Gradient Descent (SGD)](https://stats.stackexchange.com/questions/317675/gradient-descent-gd-vs-stochastic-gradient-descent-sgd) — Arya McCarthy, Mar 05 '22 at 04:17
@AryaMcCarthy NO, I'm not asking about stochastic gradient descent. — 3029 serity, Mar 05 '22 at 04:37

score 1 · Answer 1 · answered Mar 05 '22 at 10:15

Batch gradient descent, stochastic gradient descent, and mini-batch gradient descent are all gradient descents. The difference between them is how many examples are used for a single parameters update.

Batch GD: use all examples
Mini-batch GD: use a batch of examples
Stochastic GD: use only one example

score 0 · Accepted Answer · answered Mar 05 '22 at 12:33

Gradient descent takes, at each iteration, all of your data to compute the maximum of your loglikelihood, i.e. it is using, at each step, the actual function that is to be optimized, the loglikelihood. This is the most standard optimization procedure for continuous domain and range. There is nothing stochastic (random) about it.

Batch gradient descent doesn't take all of your data, but rather at each step only some new randomly chosen subset (the "batch") of it. Thus, at each step, another function (different from the actual objective function (the loglikelihood in our case)) is taken to take the gradient of. Different batches result in different functions and thus different gradients at the same parameter vector.

Now, most of the time, those batches are chosen via some kind of random procedure, and that makes the gradients that are computed at each step, random, i.e. stochastic. That's why it is called stochastic gradient descent (SGD).

Doing "batch gradient descent" without any randomness in the choice of the batches is not recommended, it will usually lead to bad results.

Some people refer to online learning as "batch gradient descent", where they use, new batches from a datastream only once, and then throw it away. But this can also be understood as SGD, provided the data stream is not containing some weird regularity.

What is the difference between gradient descent and batch gradient descent?

2 Answers2