I have about 7000000 patents that I would like to do find the document similarity of. Obviously with a sample set that big it will take a long time to run. I am just taking a small sample of about 5600 patent documents and I am preparing to use Doc2vec to find similarity between different documents. From many of the examples and the Mikolov paper he uses Doc2vec on 100000 documents that are all short reviews. My documents are much longer than reviews, like 3000+ words each, but I have way fewer of them. Should I still use Doc2vec on this limited sample set? Or should I use something like Word Mover Distance and Word2Vec since I have perhaps almost as many words as Mikolov's paper but fewer documents. Gensim has pre-trained Word2vec. I don't really understand Doc2vec/Word2vec very well, but can I use that corpus to train Doc2vec? Anyone have any suggestions?
Note: I have already implemented LDA/LSI and cosine sim of: TF-IDF. I'm looking to see which method gets the most accurate similarity measure so I can test similarity measures over time.