2

AdaGrad is an enhanced SGD that automatically determines a per-parameter learning rate.

However, in word2vec, there's no clear "parameter" to perform adagrad on. So what's the closest algorithm to adagrad for word2vec?

Glen_b
  • 257,508
  • 32
  • 553
  • 939

1 Answers1

2

The official provided implementation of word2vec only allows you to set the learning rate.

AdaGrad maintains a variable $G$ which just accumulates squared norms of the gradients seen so far, e.g. if you try to maximize $\log{p_w}$:

\begin{eqnarray*} G &\leftarrow& G + \lVert \nabla_\theta \log p_w \rVert^2 \\ \theta &\leftarrow& \theta + \frac{\eta}{\sqrt{G}} \nabla_\theta \log p_w \end{eqnarray*}

In word2vec's CBOW and skip-gram neural architectures, the parameters are the input and output vectors.

Franck Dernoncourt
  • 42,093
  • 30
  • 155
  • 271