0

I'm trying to understand Latent Dirichlet Allocation (LDA) to apply on Twitter dataset. I've a dataset with 10k tweets and I've already splitted tweets in six groups. Now I'd extract topic from each group separately but I don't understand very well the concept of "document" in LDA. I can use each group as document (so 6 documents) or I must split groups in a prefixed number of documents (i.e. taking group 1, divide tweets of this group based on the hashtags)?

Thanks

Daniele
  • 41
  • 5
  • You haven't said what you'd like to use the LDA topics to *do*. (Documents in LDA are basically any set of words, so the choice of what constitutes a document more follows from the application.) – Sean Easter May 25 '16 at 13:39
  • Yes, I'd like to understand what are the (main) topics of each group to characterise these groups. Maybe in this case i must see a "LDA document" as group, so in this case a single document is a set of words of every tweets in a specific group. – Daniele May 25 '16 at 14:35
  • "Like to understand" is a little vague as an analytical question :) Try taking a look at [this answer](http://stats.stackexchange.com/a/210371/28462) and revising your question to include specifics. If you're just looking to use the topics as a description, it sounds like you might prefer to run LDA on each set but treat each tweet as a document, and comparing those. (You could think of this very loosely as analogous to comparing the mean values between groups.) – Sean Easter May 25 '16 at 19:55

0 Answers0