Using NLTK and Pandas, I was able to process some text files and generate word count data for them, and finally create a histogram describing word frequency.
However, I'm wondering what kind of analysis should I do in order to characterize this distribution. I'm not sure how should I proceed in order to characterize it. I know for a fact that it wouldn't be possible to fit it into a Poisson distribution as the mean is different from the variance.
Any pointers on how can I find out which type of distribution could this data be fit into? Since we're talking about discrete data and looking at the histogram, my initial guesses were Poisson or Negative Binomial. However, mean is different from variance so that would leave me with negative binomial, or binomial.
I tend to think that Negative Binomial is more likely, however I still have to figure out a way to test this assumption.