How does Word2Vec's skip-gram model generate the output vectors?

Question

I am having problems understanding the skip-gram model of the Word2Vec algorithm.

In continuous bag-of-words is easy to see how the context words can "fit" in the Neural Network, since you basically average them after multiplying each of the one-hot encoding representations with the input matrix W.

However, in the case of skip-gram, you only get the input word vector by multiplying the one-hot encoding with the input matrix and then you are suppose to get C (= window size) vectors representations for the context words by multiplying the input vector representation with the output matrix W'.

What I mean is, having a vocabulary of size $V$ and encodings of size $N$, $W \in \mathbb{R}^{V\times N}$ input matrix and $W' \in \mathbb{R}^{N\times V}$ as output matrix. Given the word $w_i$ with one-hot encoding $x_i$ with context words $w_j$ and $w_h$ (with one-hot reps $x_j$ and $x_h$), if you multiply $x_i$ by the input matrix $W$ you get ${\bf h} := x_i^TW = W_{(i,\cdot)} \in \mathbb{R}^N$, now how do you generate $C$ score vectors from this?

score 7 · Accepted Answer · answered Feb 26 '16 at 23:17

7

I had the same problem understanding it. It seems that the output score vector will be the same for all C terms. But the difference in error with each one-hot represented vectors will be different. Thus the error vectors are used in back-propagation to update the weights.

Please correct me, if I'm wrong.

source : https://iksinc.wordpress.com/tag/skip-gram-model/

answered Feb 26 '16 at 23:17

chmodsss

186
7

so $W' \in \mathbb{R}^{N\times (V*C)}$ ? – Fabich Apr 05 '16 at 10:05

score 0 · Answer 2 · answered Jan 09 '17 at 09:57

In both of the models output score depends on the score function that you use. There can be two score functions softmax or negative sampling. So you use a softmax score function. You will get a score function size of N*D. Here D is the dimension of a word vector. N is the number of examples. Each word is like a class in neural net architecture.

score 0 · Answer 3 · edited Jul 11 '18 at 18:29

In the skip-gram model a one-hot encoded word is fed to two-layer shallow neural net. Since the input is one-hot encoded, the hidden layer contains only one row of the input hidden weight matrix (let's say $k_{th}$ row because the $k_{th}$ row of input vector is one).

The scores for each word is computed by the following equation.

$u = \mathcal{W'}^Th$

where h is a vector in the hidden layer and $\mathcal{W'}$ is the hidden output weight matrix. After computing $u$ $\mathcal{C}$ multinomial distributions are computed where $\mathcal{C}$ is windows size. The distributions are computed by the following equation.

$p(w_{c,j} = w_{O,c}|w_I)=\frac{\exp{u_{c,j}}}{\sum_{j'=1}^V\exp{u_{j'}}}$

As you can see all of the $\mathcal{C}$ distributions are different. (For more information: https://arxiv.org/pdf/1411.2738.pdf). In fact, this would be more clear if they would use something like the following figure.

In summary, there is only one source vector $u$. However, $\mathcal{C}$ different distributions are computed using softmax function.

$\textbf{References:}$

Xin Rong, Word2Vec Parameter Learning Explained

So the W matrix is essentially the word vectors (the output of the algorithm), and W' is a totally different matrix that we throw away? — Nathan B, Nov 06 '17 at 15:07
This is wrong. See equation (26) from Xin Rong, Word2Vec Parameter Learning Explained. In fact $p(w_{c,j} = w_{O,c}|w_I)=\frac{\exp{u_{c,j}}}{\sum_{j'=1}^V\exp{u_{j'}}}=\frac{\exp{u_{j}}}{\sum_{j'=1}^V\exp{u_{j'}}}$. The output score vector will be the same for all C terms. — siulkilulki, Jan 15 '19 at 19:00

How does Word2Vec's skip-gram model generate the output vectors?

3 Answers3

Linked