4

When using stochastic gradient descent, how do we pick a stopping criteria?

A benefit of stochastic gradient descent is that, since it is stochastic, it can avoid getting stuck in a suboptimal region.

So I don't understand how we can pick a stopping criteria, because surely then we would be limiting the algorithm's ability to "unstuck" itself?

For gradient descent, I would typically use the norm of the gradient as a stopping criteria (so we stop when that norm is small enough).

Would this make sense for SGD too?

aha
  • 41
  • 2

1 Answers1

2

We don't stop SGD because it stops dropping, we usually do it as a means of preventing overfitting.

In this sense we don't want it to reach the minimum, because this would entail memorizing the training set, which in turn reduces generalization.

As such, for a stopping criteria, usually we monitor the validation loss. This means that at the end of each epoch, we make a pass through the validation set and measure the validation loss. If we see that this stops dropping for a few iterations, we stop the training. Note that the training loss could still be dropping.

Djib2011
  • 5,395
  • 5
  • 25
  • 36