12

I am having problems understanding the skip-gram model of the Word2Vec algorithm.

In continuous bag-of-words is easy to see how the context words can "fit" in the Neural Network, since you basically average them after multiplying each of the one-hot encoding representations with the input matrix W.

However, in the case of skip-gram, you only get the input word vector by multiplying the one-hot encoding with the input matrix and then you are suppose to get C (= window size) vectors representations for the context words by multiplying the input vector representation with the output matrix W'.

What I mean is, having a vocabulary of size $V$ and encodings of size $N$, $W \in \mathbb{R}^{V\times N}$ input matrix and $W' \in \mathbb{R}^{N\times V}$ as output matrix. Given the word $w_i$ with one-hot encoding $x_i$ with context words $w_j$ and $w_h$ (with one-hot reps $x_j$ and $x_h$), if you multiply $x_i$ by the input matrix $W$ you get ${\bf h} := x_i^TW = W_{(i,\cdot)} \in \mathbb{R}^N$, now how do you generate $C$ score vectors from this?

Indie AI
  • 6,702
  • 2
  • 28
  • 32
crscardellino
  • 855
  • 2
  • 8
  • 10

3 Answers3

7

I had the same problem understanding it. It seems that the output score vector will be the same for all C terms. But the difference in error with each one-hot represented vectors will be different. Thus the error vectors are used in back-propagation to update the weights.

Please correct me, if I'm wrong.

source : https://iksinc.wordpress.com/tag/skip-gram-model/

chmodsss
  • 186
  • 7
0

In both of the models output score depends on the score function that you use. There can be two score functions softmax or negative sampling. So you use a softmax score function. You will get a score function size of N*D. Here D is the dimension of a word vector. N is the number of examples. Each word is like a class in neural net architecture.

0

In the skip-gram model a one-hot encoded word is fed to two-layer shallow neural net. Since the input is one-hot encoded, the hidden layer contains only one row of the input hidden weight matrix (let's say $k_{th}$ row because the $k_{th}$ row of input vector is one).

The scores for each word is computed by the following equation.

$u = \mathcal{W'}^Th$

where h is a vector in the hidden layer and $\mathcal{W'}$ is the hidden output weight matrix. After computing $u$ $\mathcal{C}$ multinomial distributions are computed where $\mathcal{C}$ is windows size. The distributions are computed by the following equation.

$p(w_{c,j} = w_{O,c}|w_I)=\frac{\exp{u_{c,j}}}{\sum_{j'=1}^V\exp{u_{j'}}}$

As you can see all of the $\mathcal{C}$ distributions are different. (For more information: https://arxiv.org/pdf/1411.2738.pdf). In fact, this would be more clear if they would use something like the following figure.

In summary, there is only one source vector $u$. However, $\mathcal{C}$ different distributions are computed using softmax function.

$\textbf{References:}$

  • Xin Rong, Word2Vec Parameter Learning Explained
Janothan
  • 103
  • 3
user3108764
  • 309
  • 3
  • 12
  • So the W matrix is essentially the word vectors (the output of the algorithm), and W' is a totally different matrix that we throw away? – Nathan B Nov 06 '17 at 15:07
  • W' is also word vectors which are equally good. – user3108764 Nov 07 '17 at 15:41
  • 1
    This is wrong. See equation (26) from Xin Rong, Word2Vec Parameter Learning Explained. In fact $p(w_{c,j} = w_{O,c}|w_I)=\frac{\exp{u_{c,j}}}{\sum_{j'=1}^V\exp{u_{j'}}}=\frac{\exp{u_{j}}}{\sum_{j'=1}^V\exp{u_{j'}}}$. The output score vector will be the same for all C terms. – siulkilulki Jan 15 '19 at 19:00