0

I have a collection of words from different communities. Each community has a different way of using language and will provide a different word embedding. I can concatenate the sentences from the different communities to produce one corpus, but I fear I will lose nuance between how language is used in the different communities.

Do you have any recommendations on what level I should run the word embedding on? Should I simply run it on all sentences, risking generalization. Or is there a way to factor in the differences between documents.

I'm relatively new to this. Any feedback will be helpful.

VminVsky
  • 3
  • 1

1 Answers1

0

Training word embeddings on sentence and document level typically does not make a huge difference. During training, both the skip-gram and CBOW algorithm makes prediction only in a small sliding window of text (typically 5 words), so most of the prediction will be within sentences anyway. Also, you cannot really hope to learn some document-level features becase the discourse phenomena (such as coreference, stylicistcal cohesion, etc.) occur in much longer distances than the small sliding window.

Jindřich
  • 2,261
  • 4
  • 16