I am using Stochastic Gradient Descent with ADAGRAD.
I am training on a training set of 1.6 billion examples. After about 30 million examples, the training loss starts increasing after reaching a low.
The examples in training set are ordered temporally. What should I do so that the algorithm doesn't diverge?
Things I am considering:
- Lowering learning rate
- line search
- early stopping
- adadelta
which one should I pursue?