Adam optimizer with exponential decay

Question

In most Tensorflow code I have seen Adam Optimizer is used with a constant Learning Rate of 1e-4 (i.e. 0.0001). The code usually looks the following:

...build the model...
# Add the optimizer
train_op = tf.train.AdamOptimizer(1e-4).minimize(cross_entropy)
# Add the ops to initialize variables.  These will include 
# the optimizer slots added by AdamOptimizer().
init_op = tf.initialize_all_variables()

# launch the graph in a session
sess = tf.Session()
# Actually intialize the variables
sess.run(init_op)
# now train your model
for ...:
  sess.run(train_op)

I am wondering, whether it is useful to use exponential decay when using adam optimizer, i.e. use the following Code:

...build the model...
# Add the optimizer
step = tf.Variable(0, trainable=False)
rate = tf.train.exponential_decay(0.15, step, 1, 0.9999)
optimizer = tf.train.AdamOptimizer(rate).minimize(cross_entropy, global_step=step)
# Add the ops to initialize variables.  These will include 
# the optimizer slots added by AdamOptimizer().
init_op = tf.initialize_all_variables()

# launch the graph in a session
sess = tf.Session()
# Actually intialize the variables
sess.run(init_op)
# now train your model
for ...:
  sess.run(train_op)

Usually, people use some kind of learning rate decay, for Adam it seems uncommon. Is there any theoretical reason for this? Can it be useful to combine Adam optimizer with decay?

How do you get the step Variable to update with every iteration? — perrohunter, May 03 '16 at 20:05
@perrohunter: Use the `global_step` parameter of `minimize`. See edit. — Charles Staats, Jul 30 '16 at 19:01
I see you assign "global_step=step" but I dont see how the "step" variable is being updated...can you clarify please? — Diego, Dec 01 '16 at 10:36
@Diego: late answer but: passing the step variable to minimize as it's global_step parameter makes the minimize function increase the global_step parameter each time minimize is called. See the documentation for minimize. Do note that this means that when doing mini-batches, the step variable is updated for each mini-batch, not just for each epoch. — dimpol, Mar 30 '17 at 08:28

Indie AI · Accepted Answer · 2016-03-05T16:17:54.883

Empirically speaking: definitely try it out, you may find some very useful training heuristics, in which case, please do share!

Usually people use some kind of decay, for Adam it seems uncommon. Is there any theoretical reason for this? Can it be useful to combine Adam optimizer with decay?

I haven't seen enough people's code using ADAM optimizer to say if this is true or not. If it is true, perhaps it's because ADAM is relatively new and learning rate decay "best practices" haven't been established yet.

I do want to note however that learning rate decay is actually part of the theoretical guarantee for ADAM. Specifically in Theorem 4.1 of their ICLR article, one of their hypotheses is that the learning rate has a square root decay, $\alpha_t = \alpha/\sqrt{t}$. Furthermore, for their logistic regression experiments they use the square root decay as well.

Simply put: I don't think anything in the theory discourages using learning rate decay rules with ADAM. I have seen people report some good results using ADAM and finding some good training heuristics would be incredibly valuable.

score 12 · Answer 2 · edited Mar 22 '18 at 14:55

Adam uses the initial learning rate, or step size according to the original paper's terminology, while adaptively computing updates. Step size also gives an approximate bound for updates. In this regard, I think it is a good idea to reduce step size towards the end of training. This is also supported by a recent work from NIPS 2017: The Marginal Value of Adaptive Gradient Methods in Machine Learning.

The last line in Section 4: Deep Learning Experiments says

Though conventional wisdom suggests that Adam does not require tuning, we find that tuning the initial learning rate and decay scheme for Adam yields significant improvements over its default settings in all cases.

Last but not least, the paper suggests that we use SGD anyways.

score 9 · Answer 3 · edited Apr 04 '18 at 14:07

9

The reason why most people don't use learning rate decay with Adam is that the algorithm itself does a learning rate decay in the following way:

t <- t + 1
lr_t <- learning_rate * sqrt(1 - beta2^t) / (1 - beta1^t)

where t0 is the initial timestep, and lr_t is the new learning rate used.

edited Apr 04 '18 at 14:07

gung - Reinstate Monica

132,789
81
357
650

answered May 26 '16 at 13:34

Almanzt

295
2
2

6

I'm not sure if this is the case. The factor `sqrt(1 - beta2^t) / (1 - beta1^t)` does not decay. It seems to compensate for the initialization of the first and second moment estimates. – Thijs Jul 15 '16 at 09:20
40

This answer is incorrect. That factor approaches 1.0 as t goes to infinity. Side note: learning_rate here is *fixed*. It's not the learning rate at time t-1. – rd11 Oct 03 '16 at 13:33
5

As the comments have noted, this answer is incorrect. It's unfortunate that it has been upvoted. – becko Jun 16 '20 at 16:34

score 3 · Answer 4 · answered Nov 29 '17 at 03:28

I agree with @Indie AI's opinion, here I supply some other information:

From CS231n:

... Many of these methods may still require other hyperparameter settings, but the argument is that they are well-behaved for a broader range of hyperparameter values than the raw learning rate. ...

And Also from the Paper Rethinking the Inception Architecture for Computer Vision Section 8:

... while our best models were achieved using RMSProp [21] with de- cay of 0.9 and ε = 1.0. We used a learning rate of 0.045, decayed every two epoch using an exponential rate of 0.94. ...

score 2 · Answer 5 · answered Mar 06 '18 at 22:19

I trained a dataset with real easy data, if a person is considered fat or not, height and weight - creating data calculating bmi, and if over 27, the person is fat. So very easy basic data. When using Adam as optimizer, and learning rate at 0.001, the accuracy will only get me around 85% for 5 epocs, topping at max 90% with over 100 epocs tested.

But when loading again at maybe 85%, and doing 0.0001 learning rate, the accuracy will over 3 epocs goto 95%, and 10 more epocs it's around 98-99%. Not sure if the learning rate can go below 4 digits 0.0001, but when loading the model again and using 0.00001, the accucary will hover around 99.20 - 100% and wont go below. Again, not sure if the learning rate would be considered 0, but anyway, that's what I've got...

All this using categorical_crossentropy, but mean_square gets it to 99-100% too doing this method. AdaDelta, AdaGrad, Nesterov couldn't get above 65% accuracy, just for a note.

99.20 - 100% seems kinda high. This was on a test set right? (As oppposed to measuring accuracy on the same data it was trained on) — Navin, Mar 20 '21 at 09:08
Yes this was on test set, where it would litterally have all answers from bmi 0 to 50 or so — WoodyDRN, Mar 21 '21 at 14:48

score 1 · Answer 6 · answered Nov 03 '19 at 04:54

The learning rate decay in the Adam is the same as that in RSMProp(as you can see from this answer), and that is kind of mostly based on the magnitude of the previous gradients to dump out the oscillations. So the exponential decay(for a decreasing learning rate along the training process) can be adopted at the same time. They all decay the learning rate but for different purposes.

Adam optimizer with exponential decay

6 Answers6