Why is hierarchical softmax better for infrequent words, while negative sampling is better for frequent words?

Question

I wonder why hierarchical softmax is better for infrequent words, while negative sampling is better for frequent words, in word2vec's CBOW and skip-gram models. I have read the claim on https://code.google.com/p/word2vec/.

score 12 · Answer 1 · edited Nov 30 '15 at 11:29

I'm not an expert in word2vec, but upon reading Rong, X. (2014). word2vec Parameter Learning Explained and from my own NN experience, I'd simplify the reasoning to this:

Hierarchical softmax provides for an improvement in training efficiency since the output vector is determined by a tree-like traversal of the network layers; a given training sample only has to evaluate/update $O(log(N))$ network units, not $O(N)$. This essentially expands the weights to support a large vocabulary - a given word is related to fewer neurons and visa-versa.
Negative sampling is a way to sample the training data, similar to stochastic gradient descent, but the key is you look for negative training examples. Intuitively, it trains based on sampling places it might have expected a word, but didn't find one, which is faster than training an entire corpus every iteration and makes sense for common words.

The two methods don't seem to be exclusive, theoretically, but anyway that seems to be why they'd be better for frequent and infrequent words.

score 2 · Answer 2 · answered Jan 09 '19 at 10:14

My understanding is this is because of the Huffman coding used when building the category hierarchy.

Hierarchical softmax uses a tree of sigmoid nodes instead of one big softmax, Huffman coding ensures that the distribution of data points belonging to each side of any sigmoid node is balanced. Therefore it helps eliminate the preference towards frequent categories comparing with using one big softmax and negative sampling.

score 0 · Answer 3 · answered Nov 01 '17 at 12:37

Hierarchical softmax builds a tree over the whole vocabulary and the leaf nodes representing rare words will inevitably inherit their ancestors' vector representations in the tree, which can be affected by other frequent words in the corpus. This will benefit the incremental training for new corpus.

Negative sampling is developed based on noise contrastive estimation and randomly samples the words not in the context to distinguish the observed data from the artificially generated random noise.

Why is hierarchical softmax better for infrequent words, while negative sampling is better for frequent words?

3 Answers3

Linked