Questions tagged [doc2vec]

Doc2vec (aka paragraph2vec, aka sentence embeddings) modifies the word2vec algorithm to unsupervised learning of continuous representations for larger blocks of text, such as sentences, paragraphs or entire documents.

19 questions
6
votes
0 answers

Understanding Object2Vec

AWS released an interesting feature as part of the SageMaker service called Object2Vec that lets you make an embedding for search out of pretty much anything: documents, users, products, recommendations, time series data, DNA, etc. The official…
Ryan Zotti
  • 5,927
  • 6
  • 29
  • 33
5
votes
1 answer

Why have a tanh layer, max pooling layer and then another tanh layer

I have been reading a Facebook paper, read here, and am confused about certain features of the architecture. I do not understand why they have a tanh layer, max-pooling layer, and then another tanh layer. I understand what each layer does, but I…
4
votes
1 answer

Word2Vec vs. Doc2Vec Word Vectors

I am doing some analysis on document similarity and was also interested in word similarity. I know that doc2vec inherits from word2vec and by default trains using word vectors which we can access. My question is: Should we expect these word vectors…
Tylerr
  • 1,225
  • 5
  • 16
4
votes
1 answer

How to train sentence/paragraph/document embeddings?

I'm well aware of word embeddings (word2vec or Glove) and I know of four papers treating the subject of more general embeddings : Distributed Representations of Sentences and Documents - Quoc V. Le, Tomas Mikolov …
4
votes
2 answers

Doc2Vec for large documents

I have about 7000000 patents that I would like to do find the document similarity of. Obviously with a sample set that big it will take a long time to run. I am just taking a small sample of about 5600 patent documents and I am preparing to use…
www3
  • 601
  • 8
  • 16
2
votes
1 answer

Generating Sentence Vectors from Word2Vec

I know that I can use doc2Wec and other resources to get sentence vectors. But I am very curious to generate sentence vectors using Word2Vec. I read lot of materials and found that Averaging the embeddings is the baseline architecture but it is not…
2
votes
1 answer

Pre-processing: lemmatizing and stemming make a better doc2vec?

I have a project which I will turn documents of a corpus into doc2vec. I was reading that when people convert a document to bag of words they try to improve the bag of words by removing stopwords, lemmatizing, and stemming. I was going to do this…
zipline86
  • 235
  • 2
  • 11
2
votes
1 answer

how to improve doc2vec model

I would like to do some sentence embedding on around 500 sentences. The purpose is to find for new sentences, the most similar ones within the 500 sentences. Unfortunately, for now its definitely not working. Indeed, to test my model I simply looked…
miki
  • 212
  • 2
  • 10
1
vote
0 answers

Using doc2vec embeddings as model input our perhaps similarity comparison?

Doc2vec is an extension of word2vec, which creates vector representations of documents. One can use this representations as input to some classifier/regression(Logistic Regression, XGboost, LightGBM ...). What about using the similarity as a…
Borut Flis
  • 221
  • 1
  • 8
1
vote
0 answers

NLP for customer reviews and summaries

I'm trying to develop a model in R that will compare a customer review with a summary of that review that is completed by an employee. The purpose is to ensure that the employee is accurately tagging and summarizing the customer review. In more…
pr478
  • 11
  • 1
1
vote
1 answer

How should I formalize Doc2Vec Matrix Dimension?

Below, I have a simple diagram explaining the matrix dimension of word2vec. My goal is to expand this graph to incorporate document vectors for doc2vec. However, I'm having trouble understanding the original paper, specifically about how to…
alpaca
  • 163
  • 1
  • 6
0
votes
1 answer

Normalizing Topic Vectors in Top2vec

I am trying to understand how Top2Vec works. I have some questions about the code that I could not find an answer for in the paper. A summary of what the algorithm does is that it: embeds words and vectors in the same semantic space and normalizes…
0
votes
0 answers

document subsimilarity matching

I'm looking to classify subsections of "full" documents based on their similarity to a set of subsections that have been manually curated and assigned labels (let's call these short documents). There are about 50 categories with 5-10 short documents…
COM
  • 101
  • 2
0
votes
1 answer

Doc2vec Corpus Size Recommendation

I'm trying to make a semantic search engine with Doc2Vec where you query the model a document and it returns N most similar documents from its training corpus. I'm having trouble pushing accuracy past 60% when the model is given a document it's…
fpt
  • 123
  • 3
0
votes
0 answers

What to make of high R-squared and non-significant p-value of a linear model?

I am using doc2vec to produce $\mathbb{R}^{50}$ vector representations of short bits of text. I am then using those vectors in a linear model to predict a continuous outcome variable. The R^2 is .25 which I believe is good considering what I am…
Ashish
  • 296
  • 1
  • 3
  • 12
1
2