Doc2Vec for large documents

Question

I have about 7000000 patents that I would like to do find the document similarity of. Obviously with a sample set that big it will take a long time to run. I am just taking a small sample of about 5600 patent documents and I am preparing to use Doc2vec to find similarity between different documents. From many of the examples and the Mikolov paper he uses Doc2vec on 100000 documents that are all short reviews. My documents are much longer than reviews, like 3000+ words each, but I have way fewer of them. Should I still use Doc2vec on this limited sample set? Or should I use something like Word Mover Distance and Word2Vec since I have perhaps almost as many words as Mikolov's paper but fewer documents. Gensim has pre-trained Word2vec. I don't really understand Doc2vec/Word2vec very well, but can I use that corpus to train Doc2vec? Anyone have any suggestions?

Note: I have already implemented LDA/LSI and cosine sim of: TF-IDF. I'm looking to see which method gets the most accurate similarity measure so I can test similarity measures over time.

score 2 · Accepted Answer · answered Nov 29 '16 at 15:19

2

Yes, I would try Doc2Vec with that. The build_vocab() method in gensim is akin to word2vec, in any case (i.e. only for Distributed Memory algorithm, not the DBOW which does not make word vectors). You can test the words similarities in DM route after training and see how they compare. Then also you can test the documents' also.

Another word embedding method is supposed to be good: GLoVE. There are some good tutorials in the blogosphere for doc2vec - as well as the gensim ipython notebook you could follow to get going. My intuition is that it works better with smaller short texts like tweets than longer documents, but you can try in any case.

answered Nov 29 '16 at 15:19

Luke Barker

201
2
4

here is a paper that is like how Gensim did it if you need to read more on the methodogy basis: http://arxiv.org/abs/1507.07998 – Luke Barker Nov 29 '16 at 15:20
1

Thanks yes, I applied Doc2Vec on a subset of the data and it worked well, but not as well as LSI. Not really surprised, since we had only a small slice of data to train on. Will look at the paper and thinking about how it will perform with more data. – www3 Nov 29 '16 at 21:21
2

have you tried using the pretrained models of Google and others for doc2vec? https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?usp=sharing and this list here might be a start: https://github.com/3Top/word2vec-api#where-to-get-a-pretrained-models – Luke Barker Nov 30 '16 at 15:31
Yes, I think we will end up using Facebook's Fasttext pretrained model: https://github.com/facebookresearch/fastText – www3 Mar 22 '17 at 17:40

Dat Huynh · Answer 2 · 2017-09-16T08:59:14.930

Due to the complexity of a patent document and the specifications of a patent such as abstract, description, and claims are quite different and they tends to be used in different classification purposes. For instance, a description tends to be large several pages, and claims to be several paragraphs.

In patent representation, it would be wise to split a patent document into three smaller documents respectively and these documents will be used to train the doc2vec model.

In this case, feature representation of a patent can be used as a concatenation (or with a certain weighting) of the feature representation of each specification (abstract/desc/claims)

Welcome to CV. Your answer is pretty short. Can you extend it please? — Ferdi, Sep 16 '17 at 08:22

Doc2Vec for large documents

2 Answers2