Divergence in Stochastic Gradient Descent

Question

I am using Stochastic Gradient Descent with ADAGRAD.

I am training on a training set of 1.6 billion examples. After about 30 million examples, the training loss starts increasing after reaching a low.

The examples in training set are ordered temporally. What should I do so that the algorithm doesn't diverge?

Things I am considering:

Lowering learning rate
line search
early stopping
adadelta

which one should I pursue?

How big is each of your batches for the SGD? What you may see is your algorithm avoiding overfitting. SGD sometimes takes several passes over your full dataset. You report this after going through less than 1% of your dataset. Why are you surprised? :D — usεr11852, Nov 25 '15 at 06:10
Right now the batch size is just 1. I was thinking that the loss always decreases or stays the same. But after the low I mentioned it started increasing PS: I am not using any library but coding in python. — Swapniel, Nov 25 '15 at 07:14
Apologies but I am confused, do you mean that you train for $n$ parameters with a sample size of $1$? — usεr11852, Nov 25 '15 at 07:54
I mean I take one example at a time while updating. When you asked batch size I thought number of examples considered for 1 update of parameters like taking gradient on loss of 100 examples at a time. Sorry I am pretty new to this. — Swapniel, Nov 25 '15 at 08:00

Andrew Wagner · Answer 1 · 2016-09-26T07:52:30.377

The lowest hanging fruit is to tinker with your step size. That takes almost zero effort, and can run while you're experimenting with other things, so I would start there (and you probably already did). I am also new to this, but I have seen convergence vs. divergence depend on learning rate even on a trivial toy example.

You are already doing early stopping manually, so I don't think that would be fruitful.

You say you're not using a library; does that mean you wrote your own backpropagation / automatic differentiation code? Two of my colleagues who have implemented AD codes tell me they are tricky to get right; if you rolled your own I would make sure that code is solid.

Divergence in Stochastic Gradient Descent

1 Answers1