How to calculate Zipf's law coefficient from a set of top frequencies?

Question

I have several query frequencies, and I need to estimate the coefficient of Zipf's law. These are the top frequencies:

according to [wikipedia page](http://en.wikipedia.org/wiki/Zipf's_law) Zipf's law has two parameters. Number of the elements $N$ and $s$ the exponent. What is $N$ in your case, 10? And frequencies can be calculated by dividing your supplied values by the sum of all the supplied values? — mpiktas, Feb 01 '11 at 13:44
let it's ten, and frequencies can be calculated by dividing your supplied values by the sum of all the supplied values.. how can I estimate? — Diegolo, Feb 01 '11 at 13:51

score 25 · Answer 1 · edited Apr 13 '17 at 12:44

25

There are several issues before us in any estimation problem:

Estimate the parameter.
Assess the quality of that estimate.
Explore the data.
Evaluate the fit.

For those who would use statistical methods for understanding and communication, the first should never be done without the others.

For estimation it is convenient to use maximimum likelihood (ML). The frequencies are so large we can expect the well-known asymptotic properties to hold. ML uses the assumed probability distribution of the data. Zipf's Law supposes the probabilities for $i=1,2,\ldots,n$ are proportional to $i^{-s}$ for some constant power $s$ (usually $s\gt 0$). Because these probabilities must sum to unity, the constant of proportionality is the reciprocal of the sum

$$H_s(n)=\frac{1}{1^s} + \frac{1}{2^s} + \cdots + \frac{1}{n^s}.$$

Consequently, the logarithm of the probability for any outcome $i$ between $1$ and $n$ is

$$\log(\Pr(i)) = \log\left(\frac{i^{-s}}{H_s(n)}\right) = -s\log(i) - \log(H_s(n)).$$

For independent data summarized by their frequencies $f_i, i=1,2,\ldots, n$, the probability is the product of the individual probabilities,

$$\Pr(f_1,f_2,\ldots,f_n) = \Pr(1)^{f_1}\Pr(2)^{f_2}\cdots\Pr(n)^{f_n}.$$

Thus the log probability for the data is

$$\Lambda(s) = -s \sum_{i=1}^n{f_i \log(i)} - \left(\sum_{i=1}^n{f_i}\right) \log\left(H_s(n)\right).$$

Considering the data as fixed, and expressing this explicitly as a function of $s$, makes it the log Likelihood.

Numerical minimization of the log Likelihood with the data given in the question yields $\hat{s} = 1.45041$ and $\Lambda(\hat{s}) = -94046.7$. This is significantly better (but just barely so) than the least squares solution (based on log frequencies) of $\hat{s}_{ls} = 1.463946$ with $\Lambda(\hat{s}_{ls}) = -94049.5$. (The optimization can be done with a minor change to the elegant, clear R code provided by mpiktas.)

ML will also estimate confidence limits for $s$ in the usual ways. The chi-square approximation gives $[1.43922, 1.46162]$ (if I did the calculations correctly :-).

Given the nature of Zipf's law, the right way to graph this fit is on a log-log plot, where the fit will be linear (by definition):

enter image description here

To evaluate the goodness of fit and explore the data, look at the residuals (data/fit, log-log axes again):

enter image description here

This is not too great: although there's no evident serial correlation or heteroscedasticity in the residuals, they typically are around 10% (away from 1.0). With frequencies in the thousands, we wouldn't expect deviations by more than a few percent. The goodness of fit is readily tested with chi square. We obtain $\chi^2 = 656.476$ with 10 - 1 = 9 degrees of freedom; this is highly significant evidence of departures from Zipf's Law.

Because the residuals appear random, in some applications we might be content to accept Zipf's Law (and our estimate of the parameter) as an acceptable albeit rough description of the frequencies. This analysis shows, though, that it would be a mistake to suppose this estimate has any explanatory or predictive value for the dataset examined here.

edited Apr 13 '17 at 12:44

Community

1

answered Feb 01 '11 at 17:22

whuber

281,159
54
637
1,101

1

@whuber, I might humbly suggest a little caution with the formulation given above. Zipf's law is usually stated as a relative-frequency result. It is not (normally considered) the distribution from which an iid sample is drawn. An iid framework is probably not the best idea for these data. Perhaps I'll post more on this later. – cardinal Feb 01 '11 at 18:09
3

@cardinal I look forward to what you have to say. If you don't have time for a thorough response, even a sketch of what you think might be the "best idea for these data" would be most welcome. I can guess where you're going with this: the data have been ranked, a process which creates dependencies and should require me to defend a likelihood derived without recognizing the potential effects of the ranking. It would be nice to see an estimation procedure with sounder justification. I am hopeful, though, that my analysis can be rescued by the sheer size of the dataset. – whuber Feb 01 '11 at 19:20
@whuber +1 for this detailed explanation of the modeling approach, and ensuing comments. – chl Feb 01 '11 at 22:26
@whuber, unfortunately, I fear the margins of this comment are a bit small, yet I don't think my thoughts on this would qualify as an "answer". Can you think of a sensible "generative" model from which your likelihood would arise and which would match the structure of this data? – cardinal Feb 02 '11 at 01:17
1

@cardinal, do not do Fermat on us :) If you have some different insight than other answerers feel free to express it in the separate answer, even if it will not constitute a valid answer per se. In [math.SE](http://math.stackexchange.com) for example such situations arise quite frequently. – mpiktas Feb 02 '11 at 07:43
1

@cardinal Easily. For example, you collect frequencies and identify and rank the ten highest. You hypothesize Zipf's Law. You collect a new set of frequencies and report them according to the *previous* ranking. That's the iid situation, to which my analysis is perfectly suited, *contingent* upon the new ranks agreeing with the old ones. – whuber Feb 02 '11 at 16:27
@whuber, good example. I think that last sentence is key. There is actually some recent work in this vein where it is shown that the number of words in the correct order for a model very similar to the one you posit grows quite slowly. For example, for the entire British National Corpus the number of correctly ranked words is about 75. Bruce Hill's estimator seems tailor made for similar and more general models. More to follow (hopefully). – cardinal Feb 02 '11 at 18:04
@whuber, also, if there is any dependence in how words are generated, then the frequencies are not iid. For the simplest example, consider multinomial sampling. Also, if with very high probability the frequency counts were in a consistent order, how could the associated random variables be iid? – cardinal Feb 02 '11 at 18:16
@cardinal These potential problems seem easy to overcome. The model I used is appropriate for frequencies observed by randomly sampling the text with replacement. That's an appropriate model for testing Zipf's law. If you view the data (hypothetically) as an exhaustive census of a work, then statistical testing is irrelevant: the frequencies (in this example) manifestly do not follow the law. If the ranks of the words in the dataset differ from their anticipated ranks, then it looks like sorting them makes the distribution look *more*, not *less*, like Zipf: thus my test is generous. – whuber Feb 02 '11 at 18:25
1

@whuber, thanks for your patience. Now I'm fully clear on your line of reasoning. Under the sampling model you've now fully fleshed out, I agree with your analysis. Perhaps your very last statement is still a bit slippery. If the sorting does not induce strong dependence than your method would be conservative. If the induced dependence were moderately strong, it might become anticonservative. Thx for you patience in the face of my pedantry. – cardinal Feb 02 '11 at 19:16
1

@cardinal Not pedantry; correctness. My analysis blithely overlooked the important issues that you raise. I agree that the last argument is a bit of hand-waving, too. It would still be interesting (hint) to see an approach to the original question that is directly suited to how such frequency data are typically collected. – whuber Feb 02 '11 at 19:48
@whuber In the likelihood expression $\Lambda(s)$, the second term -- is it correct? The summation term in parenthesis $( \sum _{i=1} ^n f_i )$ will be always 1. Or do I miss something? – Vladislavs Dovgalecs Feb 24 '17 at 23:22
1

@xeon Thank you for reading so closely. That term gives the total count: these are frequencies rather than relative frequencies. See the data in the question for an example. – whuber Feb 24 '17 at 23:26
@whuber Thanks for your great post and quick answer to my question. I added my answer which estimates the parameter using probabilistic programming paradigm. Perhaps some "time traveler" will find this useful. Cheers! – Vladislavs Dovgalecs Feb 25 '17 at 00:22

score 23 · Accepted Answer · edited Jan 19 '15 at 17:58

Update I've updated the code with maximum likelihood estimator as per @whuber suggestion. Minimizing sum of squares of differences between log theoretical probabilities and log frequencies though gives an answer would be a statistical procedure if it could be shown that it is some kind of M-estimator. Unfortunately I could not think of any which could give the same results.

Here is my attempt. I calculate logarithms of the frequencies and try to fit them to logarithms of theoretical probabilities given by this formula. The final result seems reasonable. Here is my code in R.

fr <- c(26486, 12053, 5052, 3033, 2536, 2391, 1444, 1220, 1152, 1039)

p <- fr/sum(fr)

lzipf <- function(s,N) -s*log(1:N)-log(sum(1/(1:N)^s))

opt.f <- function(s) sum((log(p)-lzipf(s,length(p)))^2)

opt <- optimize(opt.f,c(0.5,10))

> opt
$minimum
[1] 1.463946

$objective
[1] 0.1346248

The best quadratic fit then is $s=1.47$.

The maximum likelihood in R can be performed with mle function (from stats4 package), which helpfully calculates standard errors (if correct negative maximum likelihood function is supplied):

ll <- function(s) sum(fr*(s*log(1:10)+log(sum(1/(1:10)^s))))

fit <- mle(ll,start=list(s=1))

> summary(fit)
Maximum likelihood estimation

Call:
mle(minuslogl = ll, start = list(s = 1))

Coefficients:
  Estimate  Std. Error
s 1.451385 0.005715046

-2 log L: 188093.4

Here is the graph of the fit in log-log scale (again as @whuber suggested):

s.sq <- opt$minimum
s.ll <- coef(fit)

plot(1:10,p,log="xy")
lines(1:10,exp(lzipf(s.sq,10)),col=2)
lines(1:10,exp(lzipf(s.ll,10)),col=3)

Red line is sum of squares fit, green line is maximum-likelihood fit.

Log-log graph of fits

There's also an R package zipfR http://cran.r-project.org/web/packages/zipfR/index.html I haven't tried it though. — onestop, Feb 01 '11 at 14:51
@onestop, thanks for the link. It would be nice if someone would answer this question using this package. My solution definitely lacks depth, though it gives some kind of answer. — mpiktas, Feb 01 '11 at 15:09
(+1) You're really impressive. So many good contributions in so many different statistical fields! — chl, Feb 01 '11 at 22:22
@chl, thanks! I certainly feel that I am not the only one with such characteristics in this site though ;) — mpiktas, Feb 02 '11 at 07:37

Vladislavs Dovgalecs · Answer 3 · 2017-02-25T00:40:41.110

The Maximum Likelihood estimates are only point estimates of the parameter $s$. Extra effort is needed to find also the confidence interval of the estimate. The problem is that this interval is not probabilistic. One cannot say "the parameter value s=... is with probability of 95% in the range [...]".

One of the probabilistic programming languages such as PyMC3 make this estimation relatively straightforward. Other languages include Stan which has great features and supportive community.

Here is my Python implementation of the model fitted on the OPs data (also on Github):

import theano.tensor as tt
import numpy as np
import pymc3 as pm
import matplotlib.pyplot as plt

data = np.array( [26486, 12053, 5052, 3033, 2536, 2391, 1444, 1220, 1152, 1039] )

N = len( data )

print( "Number of data points: %d" % N )

def build_model():
    with pm.Model() as model:
        # unsure about the prior...
        #s = pm.Normal( 's', mu=0.0, sd=100 )
        #s = pm.HalfNormal( 's', sd=10 )
        s = pm.Gamma('s', alpha=1, beta=10)

        def logp( f ):
            r = tt.arange( 1, N+1 )
            return -s * tt.sum( f * tt.log(r) ) - tt.sum( f ) * tt.log( tt.sum(tt.power(1.0/r,s)) )

        pm.DensityDist( 'obs', logp=logp, observed={'f': data} )

    return model


def run( n_samples=10000 ):
    model = build_model()
    with model:
        start = pm.find_MAP()
        step = pm.NUTS( scaling=start )
        trace = pm.sample( n_samples, step=step, start=start )

    pm.summary( trace )
    pm.traceplot( trace )
    pm.plot_posterior( trace, kde_plot=True )
    plt.show()

if __name__ == '__main__':
    run()

Here are the estimates of the parameter $s$ in distribution form. Notice how compact is the estimate! With probability of 95% the true value of parameter $s$ is in the range [1.439,1.461]; the mean is about 1.45, which is very close to MLE estimates.

To provide some basic sampling diagnostics, we can see that the sampling was "mixing well" as we don't see any structure in the trace:

To run the code, one needs Python with Theano and PyMC3 packages installed.

Thanks to @w-huber for his great answer and comments!

score 1 · Answer 4 · answered Mar 30 '16 at 12:30

Here is my attempt to fit the data, evaluate and explore the results using VGAM:

require("VGAM")

freq <- dzipf(1:100, N = 100, s = 1)*1000 #randomizing values
freq <- freq  + abs(rnorm(n=1,m=0, sd=100)) #adding noize

zdata <- data.frame(y = rank(-freq, ties.method = "first") , ofreq = freq)
fit = vglm(y ~ 1, zipf, zdata, trace = TRUE,weight = ofreq,crit = "coef")
summary(fit)

s <- (shat <- Coef(fit)) # the coefficient we've found
probs <- dzipf(zdata$y, N = length(freq), s = s) # expected values
chisq.test(zdata$ofreq, p = probs) 
plot(zdata$y,(zdata$ofreq),log="xy") #log log graph
lines(zdata$y, (probs)*sum(zdata$ofreq),  col="red") # red line, num of predicted frequency

    Chi-squared test for given probabilities

data:  zdata$ofreq
X-squared = 99.756, df = 99, p-value = 0.4598

In our case Chi square's null hypotheses is that the data is distributed according to zipf's law, hence larger p-values support the claim that the data is distributed according to it. Note that even very large p-values are not a proof, just a an indicator.

score 0 · Answer 5 · edited Jun 20 '17 at 22:43

0

Just for fun, this is another instance where the UWSE can provide a closed form solution using only the top most frequency - though at a cost of accuracy. The probability on $ x = 1$ is unique across parameter values. If $ \hat{w_{x=1}} $ denotes the corresponding relative frequency then,

$$ \hat{s_{UWSE}} = H_{10}^{-1}(\frac{1}{\hat{w_{x=1}}}) $$

In this case, since $\hat{w_{x=1}} = 0.4695599775 $, we get:

$$ \hat{s_{UWSE}} = 1.4$$

Again, the UWSE only provides a consistent estimate - no confidence intervals, and we can see some trade-off in accuracy. mpiktas' solution above is also an application of the UWSE - though programming is required. For a full explanation of the estimator see:https://paradsp.wordpress.com/ - all the way at the bottom.

edited Jun 20 '17 at 22:43

Michael R. Chernick

39,640
28
74
143

answered Jun 20 '17 at 21:33

CYP450

21
3

How does UWSE relate to Zipf's law? – Michael R. Chernick Jun 20 '17 at 22:44
The UWSE (Unique Weight Space Estimation) uses the fact that the top-most probability/frequency is unique across different values of the parameter s, for a given N, to find s. With respect to Zipf's law , this is telling us that given a number of items to rank ,N, and a topmost frequency, there is only one way assign frequencies to the remaining items (2,...,N) such that we can say "the nth item is 1/n^s times as large as the most frequent item , for some s". In other words, given this info, there is only one way for Zipf's law to hold - of course, assuming that Zipf's law does in fact hold. – CYP450 Jun 21 '17 at 00:16

score 0 · Answer 6 · answered Feb 26 '19 at 20:08

My solution try to be complementary to the answers provided by mpiktas and whuber doing an implementation in Python. Our frequencies and ranges x are:

freqs = np.asarray([26486, 12053, 5052, 3033, 2536, 2391, 1444, 1220, 1152, 1039])
x = np.asarray([1, 2, 3, 4, 5 ,6 ,7 ,8 ,9, 10])

As our function is not defined in all range, we need to check that we are normalising each time we compute it. In the discrete case, a simple approximation is to divide by the sum of all y(x). In this way we can compare different parameters.

f,ax = plt.subplots()
ax.plot(x, f1, 'o')
ax.set_xscale("log")
ax.set_yscale("log")

def loglik(b):  
    # Power law function
    Probabilities = x**(-b)

    # Normalized
    Probabilities = Probabilities/Probabilities.sum()

    # Log Likelihoood
    Lvector = np.log(Probabilities)

    # Multiply the vector by frequencies
    Lvector = np.log(Probabilities) * freqs

    # LL is the sum
    L = Lvector.sum()

    # We want to maximize LogLikelihood or minimize (-1)*LogLikelihood
    return(-L)

s_best = minimize(loglik, [2])
print(s_best)
ax.plot(x, freqs[0]*x**-s_best.x)

The result gives us a slope of 1.450408 as in previous answers.

LEo · Answer 7 · 2021-11-30T18:52:21.957

Here is a simple example using the text Ulysses.

I'll use a simple bash script to acquire the type frequencies:

cat ulysses.txt | tr 'A-Z' 'a-z' | tr -dc 'a-z ' | tr ' ' '\n' | sort | uniq -c | sort -k1,1nr | awk '{print $1}' > ulysses_freq.txt

And then use R to fit model using mle.

The normalized frequency of the element of rank k, $f(k;s,N)$, is defined as $$ f(k;s,N)=\frac{1/k^s}{\sum\limits_{n=1}^N (1/n^s)} = \frac{1/k^s}{H_{N,s}} $$ where $N$ is the vocabulary size and $s$ the exponent characterizing the distribution.

In a given sample, if we have $M$ tokens, the logarithm of the likelihood function will be given by: $$ \ln L(s|k_1,\cdots,k_M,N) = -M \ln H_{N,s} - s \sum_{m=1}^{M} \ln k_m . $$

The R code to load the data, fit the model and plot is presented below:

library('stats4')
library('ggplot2')
freq <- as.numeric(readLines('ulysses_freq.txt'))
zipf.ll <- function(s) {
    N <- length(freq)
    M <- sum(freq)
    idx=1:length(freq)
    return( M * log( sum((1:N)^(-s)) ) + s*sum(freq[idx] * log(idx)) )
}

m.fit <- mle(minuslogl=zipf.ll, start=list(s=1), lower=c(0.5), upper=c(2), nobs=as.integer(sum(freq)))
summary(m.fit)

s <- coef(m.fit)
N <- length(freq)
f <- ((1:N)^(-s))/(sum((1:N)^(-s)))
ggplot() + geom_point(aes(x=1:N,y=freq/sum(freq))) + geom_line(aes(x=1:N,y=f),colour='blue') + scale_x_continuous(trans='log10') + scale_y_continuous(trans='log10') + xlab('k') + ylab('f(k|s,N)') + theme_minimal()

How to calculate Zipf's law coefficient from a set of top frequencies?

7 Answers7

Linked