Doc2vec Corpus Size Recommendation

Question

I'm trying to make a semantic search engine with Doc2Vec where you query the model a document and it returns N most similar documents from its training corpus. I'm having trouble pushing accuracy past 60% when the model is given a document it's trained on and I ask whether the most similar document it's trained on is that query document. My previous experience doing this had no trouble getting 97% accuracy, but the corpus had 10x as many tokens in 25x as many documents.

What's a rule-of-thumb reasonable lower limit on corpus size for Doc2Vec, both for tokens-per-document and document count of the corpus?

score 0 · Answer 1 · answered Mar 17 '21 at 05:53

Can't really give you a rule of thumb along the lines that your are looking for, but I've dealt with similar challenges before.

There might be a better approach than trying to figure out the interplay between corpus size, token size, and the type of embedding you're trying to create. You can take one of the pertained general embeddings for a language (see here, for embeddings that can be directly imported into a Tensorflow script) and then fine tune it to your specific corpus. This is assuming your language has such published embeddings.

In my past experience, in some cases, even the fine-tuning wasn't necessary, the generic language embeddings did a good enough job of representing the documents that even fine tuning wasn't necessary.

Typically fine tuning a pre-existing embedding requires less data than training a domain specific custom embedding.

Doc2vec Corpus Size Recommendation

1 Answers1