Question about Continuous Bag of Words

Question

I'm having trouble understanding this sentence:

The first proposed architecture is similar to the feedforward NNLM, where the non-linear hidden layer is removed and the projection layer is shared for all words (not just the projection matrix); thus, all words get projected into the same position (their vectors are averaged).

What is the projection layer vs the projection matrix? What does it mean to say that all words get projected into the same position? And why does it mean that their vectors are averaged?

The sentence is the first of section 3.1 of Efficient estimation of word representations in vector space (Mikolov et al. 2013).

score 6 · Accepted Answer · edited Oct 18 '15 at 00:49

6

Figure 1 there clarifies things a bit. All word vectors from window of a given size are summed up, result is multiplied by (1/window size) and then fed into output layer.

Projection matrix means a whole lookup table where each word corresponds to single real-valued vector. Projection layer is effectivly a process that takes a word (word index) and returns corresponding vector. One can either concatenate them (obtaining input of size k*n where k is window size and n is vector length) or as in CBOW model, just sum all of them (obtaining input of size n).

edited Oct 18 '15 at 00:49

Franck Dernoncourt

42,093
30
155
271

answered Mar 05 '15 at 10:11

Denis Tarasov

1,504
10
6

First, thanks for your answer. I'm still a little confused by the distinction between projection matrix and projection layer. They seem the same. – user70394 Mar 05 '15 at 19:00
@user70394 Yes, in fact I find terminology somewhat confusing. Basically any NN layer is a function that maps inputs to outputs. Projection layer does that using weights from projection matrix but it is not the matrix itself. Given same matrix one can define many different functions. In fact, in case of CBOW we could probably say that we have projection layer with time delay followed by summation layer. In RNNLM model "projection layer" is in fact a part of recurrent hidden layer that combines weigths of projection matrix with recurrent weigths to compute outputs. – Denis Tarasov Mar 06 '15 at 09:33

fnl · Answer 2 · 2017-07-10T12:30:54.770

As I was browsing around regarding CBOW issues and stumbled upon this, here is an alternative answer to your (first) question ("What is a projection layer vs. matrix?"), by looking at the NNLM model (Bengio et al., 2003):

$Bengio et al., 2003, Figure 1: Neural architecture: f (i,w_{t−1}, ··· ,w_{t−n+1})=g(i,C(w_{t−1}), ··· ,C(w_{t−n+1})) where g is the neural network and C(i) is the i-th word feature vector.$

If comparing this to Mikolov's model[s] (shown in an alternative answer to this question), the cited sentence (in the question) means that Mikolov removed the (non-linear!) $tanh$ layer seen in Bengio's model shown above. And Mikolov's first (and only) hidden layer, instead of having individual vectors $C(w_i)$ for each word, only uses one vector that sums up the "word parameters", and then those sums get averaged. So this explains the last question ("What does it mean that the vectors are averaged?"). The words are "projected into the same position" because the weights assigned to the individual input words are summed up and averaged in Mikolov's model. Therefore, his projection layer looses all positional information, unlike Bengio's first hidden layer (aka. the projection matrix) - thereby answering the second question ("What does it mean that all words get projected into the same position?"). So Mikolov's model[s] retained the "word parameters" (the input weight matrix), removed the projection matrix $C$ and the $tanh$ layer, and replaced both with a "simple" projection layer.

To add, and "just for the record": The real exciting part is Mikolov's approach to solving the part where in Bengio's image you see the phrase "most computation here". Bengio tried to lessen that problem by doing something that is called hierarchical softmax (instead of just using the softmax) in a later paper (Morin & Bengio 2005). But Mikolov, with his strategy of negative subsampling took this a step further: He doesn't compute the negative log-likelihood of all "wrong" words (or Huffman codings, as Bengio suggested in 2005) at all, and just computes a very small sample of negative cases, which, given enough such computations and a clever probability distribution, works extremely well. And the second and even more major contribution, naturally, is that the whole thing about his additive "compositionality" ("man + king = woman + ?" with answer queen), which only really works well with his Skip-Gram model, and can be roughly understood as taking Bengio's model, applying the changes Mikolov suggested (i.e., the phrase cited in your question), and then inverting the whole process. That is, guessing the surrounding words from the output words (now used as input), $P(context | w_t = i)$, instead.

Question about Continuous Bag of Words

2 Answers2