How to perform adagrad stochastic gradient descent (SGD) on word2vec?

Question

AdaGrad is an enhanced SGD that automatically determines a per-parameter learning rate.

However, in word2vec, there's no clear "parameter" to perform adagrad on. So what's the closest algorithm to adagrad for word2vec?

score 2 · Answer 1 · edited Apr 13 '17 at 12:44

The official provided implementation of word2vec only allows you to set the learning rate.

AdaGrad maintains a variable $G$ which just accumulates squared norms of the gradients seen so far, e.g. if you try to maximize $\log{p_w}$:

\begin{eqnarray*} G &\leftarrow& G + \lVert \nabla_\theta \log p_w \rVert^2 \\ \theta &\leftarrow& \theta + \frac{\eta}{\sqrt{G}} \nabla_\theta \log p_w \end{eqnarray*}

In word2vec's CBOW and skip-gram neural architectures, the parameters are the input and output vectors.

How to perform adagrad stochastic gradient descent (SGD) on word2vec?

1 Answers1