Difference between GradientDescentOptimizer and AdamOptimizer (TensorFlow)?

Question

I've written a simple MLP in TensorFlow which is modelling a XOR-Gate.

So for:

input_data = [[0., 0.], [0., 1.], [1., 0.], [1., 1.]]

it should produce the following:

output_data = [[0.], [1.], [1.], [0.]]

The network has an input layer, a hidden layer and an output layer with 2, 5 and 1 neurons each.

Currently I have the following cross entropy:

cross_entropy = -(n_output * tf.log(output) + (1 - n_output) * tf.log(1 - output))

I've also tried this simpler alternative:

cross_entropy = tf.square(n_output - output)

alongside with some other tries.

However, no matter what my setup was, the error with a GradientDescentOptimizer was decreasing much slower than an AdamOptimizer.

In fact tf.train.AdamOptimizer(0.01) produced really good results after 400-800 learning steps (in dependency of the learning rate, where 0.01 had the best results) while tf.train.GradientDescentOptimizer always needed over 2000 learning steps no matter what cross entropy calculation or learning rate was used.

Why is this so? It seems the AdamOptimizer is always a better choice?!

The Adam optimizer is more sophisticated than gradient descent (it is based on [this paper](http://arxiv.org/pdf/1412.6980v7.pdf)). — Marc Claesen, Dec 01 '15 at 13:58

score 85 · Accepted Answer · answered Dec 01 '15 at 18:27

The tf.train.AdamOptimizer uses Kingma and Ba's Adam algorithm to control the learning rate. Adam offers several advantages over the simple tf.train.GradientDescentOptimizer. Foremost is that it uses moving averages of the parameters (momentum); Bengio discusses the reasons for why this is beneficial in Section 3.1.1 of this paper. Simply put, this enables Adam to use a larger effective step size, and the algorithm will converge to this step size without fine tuning.

The main down side of the algorithm is that Adam requires more computation to be performed for each parameter in each training step (to maintain the moving averages and variance, and calculate the scaled gradient); and more state to be retained for each parameter (approximately tripling the size of the model to store the average and variance for each parameter). A simple tf.train.GradientDescentOptimizer could equally be used in your MLP, but would require more hyperparameter tuning before it would converge as quickly.

Thanks for this answer! Could you add what exactly you mean when you talk about "*hyperparameter tuning*"? — daniel451, Dec 01 '15 at 18:31
Sorry for the jargon! Approximately speaking, I meant "varying the `learning_rate` argument to the `tf.train.GradientDescentOptimizer` constructor until it converges faster." :) — mrry, Dec 01 '15 at 18:39

Difference between GradientDescentOptimizer and AdamOptimizer (TensorFlow)?

1 Answers1