Characterizing/Fitting Word Count Data into Zipf / Power Law / LogNormal

Question

Using NLTK and Pandas, I was able to process some text files and generate word count data for them, and finally create a histogram describing word frequency.

However, I'm wondering what kind of analysis should I do in order to characterize this distribution. I'm not sure how should I proceed in order to characterize it. I know for a fact that it wouldn't be possible to fit it into a Poisson distribution as the mean is different from the variance.

Any pointers on how can I find out which type of distribution could this data be fit into? Since we're talking about discrete data and looking at the histogram, my initial guesses were Poisson or Negative Binomial. However, mean is different from variance so that would leave me with negative binomial, or binomial.

I tend to think that Negative Binomial is more likely, however I still have to figure out a way to test this assumption.

The distribution that maximizes the entropy given a bunch of sufficient statistics (like you have) is an exponential family with those sufficient statistics. For example here: https://web.stanford.edu/class/stats311/Lectures/lec-07.pdf — www3, Mar 01 '18 at 20:08
Thank you @www3. I was looking for a pythonic way to test if this data fits into a given distribution. I could inspect my plot against the plots of the distributions from the exponential family, but I wonder if there is another way to do this using scipy — born to hula, Mar 02 '18 at 17:45
What is the minimum of your data? Did you truncate the distribution by ignoring anything with a count less than a certain value? To me, this looks like a Poisson or negative binomial distribution, truncated at 5 — Mark White, Mar 02 '18 at 18:11
You're correct @Mark White. I removed words with occurrence inferior to five, and also removed some stopwords such as "the", "of" etc. Also looks like Poisson or Negative Binomial to me - however mean is different from variance. This would leave me with Negative Binomial - I just wanted to know if there is some way to calculate the Goodness of Fit for it, or any other possible method to accept/reject the null hypothesis. — born to hula, Mar 02 '18 at 19:09
check out https://cran.r-project.org/web/packages/aster/vignettes/trunc.pdf. I would look for some type of Python implementation for truncated negative-binomial distributions — Mark White, Mar 02 '18 at 21:54
@MarkWhite changed the scope and the question a bit. I'm no longer truncating the distribution - decided to use the raw data. Note that "the" is clearly the most frequent word. Also started a bounty for this btw. — born to hula, Mar 04 '18 at 19:01

David Dale · Answer 1 · 2018-03-05T11:10:54.713

The distribution of word frequencies is often characterized by Zipf's law, which states that it has Pareto distribution $p(k) \sim k^{-s}$, so-called power law.

This power law can be well seen as a straight line on the log-log plot of word counts:

import nltk.corpus
import numpy as np
import matplotlib.pyplot as plt
from collections import Counter
# may need nltk.download() to use Brown corpus
counter_of_words = Counter(nltk.corpus.brown.words())
counter_of_counts = Counter(counter_of_words.values())
word_counts = np.array(list(counter_of_counts.keys()))
freq_of_word_counts = np.array(list(counter_of_counts.values()))
plt.scatter(np.log(word_counts), np.log(freq_of_word_counts))
plt.xlabel('Log of word frequency')
plt.ylabel('Log of number of such words')
plt.title('Power law for word frequencies')
plt.show();

The negated slope of this line (roughly 0.5) corresponds to the parameter $s$ of the Zipf law. You can estimate this value with maximizing likelihood:

def neg_zipf_likelihood(s):
    n = sum(freq_of_word_counts)
    # for each word count, find the probability that a random word has such word count
    probas = word_counts ** (-s) / np.sum(np.arange(1, n+1) **(-s))
    log_likelihood = sum(np.log(probas) * word_counts)
    return -log_likelihood

from scipy.optimize import minimize_scalar
s_best = minimize_scalar(neg_zipf_likelihood, [0.1, 3.0] )
print(s_best.x)

which gives you the value of 0.5366.

If you are still not sure whether you need Zipf's distribution or any other distribution, you can compare log likelihood of your data under different distribution, or choose one using Kolmogorov-Smirnov test.

Thanks for your answer - this looks promising! Could you please explain or point me to some material on how to interpret the result you obtained above in regards to calculating the neg_zipf_likelihood? — born to hula, Mar 05 '18 at 14:50
@born_to_hula, if you mean the value 0.5366, it is just the parameter of Zipf distribution, just like mean and variance for Normal distribution, or mean (lambda) for Poisson, or p and r for Negative binomial. To understand how I obtained it, you can read the Wikipedia articles on Zipf law and on MLE. — David Dale, Mar 05 '18 at 14:52
The slope in your data looks to be 1.66 (10/6). I think that your fitting is not working well — ivangtorre, Feb 26 '19 at 19:23

score 0 · Answer 2 · answered Feb 26 '19 at 20:50

I have used the example of David Dale for implementing a MLE power law discrete function.

import nltk.corpus
import numpy as np
import matplotlib.pyplot as plt
from collections import Counter
# may need nltk.download() to use Brown corpus
counter_of_words = Counter(nltk.corpus.brown.words())
counter_of_counts = Counter(counter_of_words.values())

# We sort data
counter_of_counts = sorted(counter_of_counts.items(), key=lambda pair: pair[1], reverse=True)
word_counts = np.asarray(counter_of_counts)[:,0]
freq_of_word_counts = np.asarray(counter_of_counts)[:,1]


f,ax = plt.subplots()
ax.scatter(word_counts, freq_of_word_counts, label = "data")
ax.set_xlabel('Word frequency')
ax.set_ylabel('Number of such words')
ax.set_xscale("log")
ax.set_yscale("log")



def loglik(b):  
    # Power law function
    Probabilities = word_counts**(-b)

    # Normalized
    Probabilities = Probabilities/Probabilities.sum()

    # Log Likelihoood
    Lvector = np.log(Probabilities)

    # Multiply the vector by frequencies
    Lvector = np.log(Probabilities) * freq_of_word_counts

    # LL is the sum
    L = Lvector.sum()

    # We want to maximize LogLikelihood or minimize (-1)*LogLikelihood
    return(-L)

s_best = minimize(loglik, [2])
print(s_best)
ax.plot(word_counts[0:2*10**2], 4*10**4*word_counts[0:2*10**2]**-s_best.x, '--', color="orange", lw=3, label = "fitted MLE")
ax.legend()

The result give us a slope of 1.62 which visually fits the data very well.

Characterizing/Fitting Word Count Data into Zipf / Power Law / LogNormal

2 Answers2