Would a concatenated CBOW architecture outperform SG on rare words?

Question

Typically, the Skip-Gram (SG) architecture is argued to be superior to Continuous Bag-of-Words (CBOW) for rare words because the CBOW averages over the input words (Mikolov et al., 2013). That is, CBOW's hidden layer is defined as: $$\begin{equation} {\bf h} = {\bf W}^T \sum^C_i({\bf x}_i) C^{-1} = {\bf W}^T {\bf \bar x} \end{equation}$$ Where $\bf h$ is the vector of the hidden layer of size $H$, $\bf W$ the input to hidden weight matrix, and $\bf x$ is the input vector [1] for the $i$ th out of $C$ context words surrounding our target word, giving rise to the "average context vector" ${\bf \bar x}$. With this (let me also call it:) "average" CBOW architecture, the weight matrix is of size $$V \times N$$ Where $V$ is the size of the input vocabulary and $N$ is the number of hidden units (i.e., the length of the hidden units vector $\bf h$) [3].

Now, instead of computing this semantically rather ambiguous "average context vector" ${\bf \bar x} = \sum^C_i({\bf x}_i) C^{-1}$, it seems much more "semantically sound" to concatenate the context words to one input vector ${\bf x} = {\bf x}_1 \Vert \cdots \Vert {\bf x}_C$. After all, this "average" of two words seems devoid of any clear meaning: Take any closely neighboring, non-equal words in this post, like [take, any, words], and you get $\frac{"take" + "any" + "words"}{3}$ - seriously!?

With this, instead of the first equation, it is now possible to define the hidden layer as: $$\begin{equation} {\bf h} = {\bf W}^T ({\bf x}_1 \Vert \cdots \Vert {\bf x}_C) = {\bf W}^T {\bf x} \end{equation}$$ Where $\Vert$ denotes vector concatenation and the size of the 2-dimensional weight matrix $\bf W$ changes to $$CV \times N$$ While that change implies a larger input-to-hidden layer weight matrix (by a factor of $C$), the "data size" of the input layer stays the same, just instead of a $C \times V$ matrix it now is a single vector of length $CV$. However, the "blur" from averaging the context goes away. In addition, $C \ll V$, with $C$ typically being ten words or less (Mikolov et al. suggest four+four for CBOW), and $V$ possibly in the tens or hundreds of thousands. If we look at the computational costs, the argument for CBOW is that its optimal computational complexity $Q$ is defined as:

$$ Q = C \times H + H \times V $$

With the "concatenated" CBOW, this complexity does not change, because we still are multiplying $C$ input vectors to get the hidden states $H$ (and the complexity is dominated by the second operation, the hidden-to-output transformation, with $H \times V$). Moreover, CBOW has a lower computational complexity than SG, which for the SG architecture is, from Mikolov et al. (as extracted and "translated" to my variable names here [3]):

$$ Q = C \div 2 \times (H + H \times V) $$

I.e., the only cost I can spot with a "concatenated" CBOW architecture is the size increase of the first weight matrix (by a factor of $C$).

Such a "concatenated" CBOW architecture provides for position specific weights (i.e., syntactic information) for the context words, something that neither the "average" CBOW or the SG architecture does. In fact, I would even assume that you actually loose much valuable information with "averaged" CBOW and even SG [2]. Furthermore, CBOW seems generally preferably over SG (were it not for this issue about rare words), because CBOW can predict an embedding for OOV words. And, last but not least, the CBOW is computationally more favorable than the SG architecture.

So the seemingly trivial question I cannot find an existing answer to is: Why not [do what I just described and] treat the context as one concatenated input vector, [as outlined above, and instead of averaging over the context word vectors,] therefore training position-specific weights free of the averaging "blur"? (Other than the obvious increase in the size of the weight matrix, naturally.) Framing my question a bit more radical: Isn't the "averaged" CBOW architecture (as described in Mikolov et al., 2013, in Section 3.1 and Figure 1, left) necessarily inferior to their newly invented SG architecture, and a "concatenated" CBOW (as described here) would most likely outperform the SG architecture even or rare words?

EDIT/ADDENDUM: I should clarify that I am mostly poking at the final results when referring to "performance", that is, Table 6 in the Mikolov et al. paper, where the CBOW is only slightly behind SG, but faster to train.

[1] For sake of simplicity, we can assume context (word) vectors are one-hot encoded, but it could just as well be a hierarchical softmax.

[2] Although, citing from Mikolov et al.'s paper, the do use "some" positional information:

we give less weight to the distant words by sampling less from those words in our training examples

(from Section 3.2 of the paper)

[3] In Mikolov et al.'s paper $V$ is the same as $V$ here (vocabulary size), but they use $D$ for my $H$ (size of the hidden layer) and $N$ for what I call $C$ (number of context words). Finally, his $C$ is roughly the same as his $N$, or more precisely, $C = N \div 2$.

After some more thinking, I guess the real answer here isn't in the rare words, but rather in the fact that you don't get what Mikolov et al. call "additive compositionality" (Section 5 of their NIPS paper) with a CBOW model. So while the CBOW model "as defined" is maybe sub-par, even a "concatenated" CBOW model will still be less interesting than a SG model. Whatever, I'll leave the question around in case someone has something interesting to say here. — fnl, Dec 30 '16 at 18:13

Would a concatenated CBOW architecture outperform SG on rare words?

0 Answers0