2

I have a project which I will turn documents of a corpus into doc2vec. I was reading that when people convert a document to bag of words they try to improve the bag of words by removing stopwords, lemmatizing, and stemming.

I was going to do this for my doc2vec preparation but I was reading that it is not necessary to lemmatize and stemming. So, I just removed the stop words. Does anybody have experience with doc2vec and what the best pre-processing steps that will make the best doc2vec represenation?

zipline86
  • 235
  • 2
  • 11

1 Answers1

0

The authors of doc2vec didn't clarify how pre-processing effects on the evaluation of a model in the literature. However, they said that special characters such as ,.!? are treated as a normal word.

The word vectors in the pre-trained word2vec model of google are also not stemmed/lemmatized.

I believe there is no standard but whether you perform pre-processing or do not depend on your goal such as linguistic analysis.

Check this also: https://groups.google.com/forum/#!topic/gensim/17Knu4Xoe9U

Yoo Inhyeok
  • 161
  • 8