Word2Vec vs. Doc2Vec Word Vectors

Question

I am doing some analysis on document similarity and was also interested in word similarity. I know that doc2vec inherits from word2vec and by default trains using word vectors which we can access.

My question is:

Should we expect these word vectors and by association any of the methods such as most_similar to be 'better' than word2vec or are they essentially going to be the same? If in the future I only wanted word similarity should I just default to word2vec?

Edgar · Accepted Answer · 2021-02-02T19:39:38.810

The doc2vec implementation in Python from the gensim library works the following way:

It basically trains word vectors like word2vec, but trains document vectors at the same time.

That is, if you run just word2vec, every observation is a sample text=document and you learn the word vectors for all words that occur in the sample texts (minus the ones you exclude, e.g. common words like "the"). This is done by iterating over all observations one by one multiple times.

If you run doc2vec, every observation is a sample text like above, and you learn word vectors for all words that occur in the sample texts. But in addition, you learn one vector for the observation=sample text=document itself. That is, you still iterate over all observations multiple times, but in every step when updating your word vectors with the data from one observation, you update the document vector corresponding to this particular observation=document, too. Basically the document itself is treated as a word that only occurs in this document. See Figure 2 in the paper that introduced the doc2vec algorithm: here.

In essence, you get word vectors AND document vectors when running doc2vec, but of course, it can take longer than just running word2vec. And in theory, the word vectors should be the same between word2vec and doc2vec (or rather, hold the same information, since they get randomly initialized and no two runs will ever be the same).

Word2Vec vs. Doc2Vec Word Vectors

1 Answers1