Pre-processing: lemmatizing and stemming make a better doc2vec?

Question

I have a project which I will turn documents of a corpus into doc2vec. I was reading that when people convert a document to bag of words they try to improve the bag of words by removing stopwords, lemmatizing, and stemming.

I was going to do this for my doc2vec preparation but I was reading that it is not necessary to lemmatize and stemming. So, I just removed the stop words. Does anybody have experience with doc2vec and what the best pre-processing steps that will make the best doc2vec represenation?

score 0 · Accepted Answer · answered Jul 23 '19 at 11:28

The authors of doc2vec didn't clarify how pre-processing effects on the evaluation of a model in the literature. However, they said that special characters such as ,.!? are treated as a normal word.

The word vectors in the pre-trained word2vec model of google are also not stemmed/lemmatized.

I believe there is no standard but whether you perform pre-processing or do not depend on your goal such as linguistic analysis.

Check this also: https://groups.google.com/forum/#!topic/gensim/17Knu4Xoe9U

Pre-processing: lemmatizing and stemming make a better doc2vec?

1 Answers1