Better Performance With Gradient Descent than Adam on word2vec

Question

I was implementing word2vec in TensorFlow and found that Gradient Descent worked much better and faster than the AdamOptimizer. I was under the impression that Adam was the "smarter" option that almost always does better than GD. I used several starting learning rates for Adam, from 1.0 to 0.01, but none did nearly as well as GD with a learning rate of 1.0. Am I missing something about these optimizers or their application to word2vec in particular?

Code:

# Define the placeholders for input and output
center_words = tf.placeholder(tf.int32, shape=[BATCH_SIZE], name='center_words')
target_words = tf.placeholder(tf.int32, shape=[BATCH_SIZE, 1], name='target_words')

# Define weights. In word2vec, it's actually the weights that we care about
embed_matrix = tf.Variable(tf.random_uniform([VOCAB_SIZE, EMBED_SIZE], -1.0, 1.0), 
                        name='embed_matrix')

# Define the inference
embed = tf.nn.embedding_lookup(embed_matrix, center_words, name='embed')

# Construct variables for NCE loss
nce_weight = tf.Variable(tf.truncated_normal([VOCAB_SIZE, EMBED_SIZE],
                                            stddev=1.0 / (EMBED_SIZE ** 0.5)), 
                                            name='nce_weight')
nce_bias = tf.Variable(tf.zeros([VOCAB_SIZE]), name='nce_bias')

# Define loss function to be NCE loss function
loss = tf.reduce_mean(tf.nn.nce_loss(weights=nce_weight, 
                                    biases=nce_bias, 
                                    labels=target_words, 
                                    inputs=embed, 
                                    num_sampled=NUM_SAMPLED, 
                                    num_classes=VOCAB_SIZE), name='loss')

# Define optimizer
global_step = tf.Variable(0, name='global_step', trainable=False)
optimizer = tf.train.GradientDescentOptimizer(1.0).minimize(loss, global_step=global_step)

Just a guess -- Vanilla Adam updates all parameters at every step, while lazy Adam only updates parameters that are actually employed -- in a sparse setting like a language model, that can make a big difference, because lazy Adam applies no updates to rare words until they appear, at which time they get a *big* update. More common words are updated more frequently. — Sycorax, Jul 04 '17 at 04:35
@Sycorax So you're saying that LazyAdam would do better since it doesn't needlessly update parameters it shouldn't be messing with? I'm having a little trouble understanding why the sparsity would affect it. — Puzzler3141, Jul 04 '17 at 04:39
To future reviewers: I'm not sure why this question was closed. It's asking about optimization, and my understanding is that optimization is on-topic here. — Sycorax, Jul 04 '17 at 15:02
Possibly answered here: https://stats.stackexchange.com/questions/313278/no-change-in-accuracy-using-adam-optimizer-when-sgd-works-fine — Sycorax, Sep 20 '18 at 14:56
It seems that Adam is not good for word embedding models in general, better to use Adagrad. https://hackernoon.com/various-optimisation-techniques-and-their-impact-on-generation-of-word-embeddings-3480bd7ed54f http://ruder.io/optimizing-gradient-descent/index.html#adagrad — SantoshGupta7, Aug 01 '19 at 04:34

Better Performance With Gradient Descent than Adam on word2vec

0 Answers0