I'm trying to make a semantic search engine with Doc2Vec where you query the model a document and it returns N most similar documents from its training corpus. I'm having trouble pushing accuracy past 60% when the model is given a document it's trained on and I ask whether the most similar document it's trained on is that query document. My previous experience doing this had no trouble getting 97% accuracy, but the corpus had 10x as many tokens in 25x as many documents.
What's a rule-of-thumb reasonable lower limit on corpus size for Doc2Vec, both for tokens-per-document and document count of the corpus?