When to use documents vs. sentences for Word2Vec?

Question

I have a collection of words from different communities. Each community has a different way of using language and will provide a different word embedding. I can concatenate the sentences from the different communities to produce one corpus, but I fear I will lose nuance between how language is used in the different communities.

Do you have any recommendations on what level I should run the word embedding on? Should I simply run it on all sentences, risking generalization. Or is there a way to factor in the differences between documents.

I'm relatively new to this. Any feedback will be helpful.

score 0 · Accepted Answer · answered Nov 16 '20 at 09:24

Training word embeddings on sentence and document level typically does not make a huge difference. During training, both the skip-gram and CBOW algorithm makes prediction only in a small sliding window of text (typically 5 words), so most of the prediction will be within sentences anyway. Also, you cannot really hope to learn some document-level features becase the discourse phenomena (such as coreference, stylicistcal cohesion, etc.) occur in much longer distances than the small sliding window.

When to use documents vs. sentences for Word2Vec?

1 Answers1