5

Zipf's law states that in a text set $s=1$ a few words occur very often, and many words hardly ever occur. Zipf’s law for text sets $s = 1$ in the Zipf distribution defined by:

$$f(k; s, N) = \frac{k^{-s}}{\sum^N_{i=1}i^{-s}}$$

where $f(·)$ denotes the normalised frequency of a term, $k$ denotes the term’s frequency rank in our corpus (with $k = 1$ being the highest rank), $N$ is the number of terms in our vocabulary, and $s$ is a distribution parameter.

Question: If my data doesn't completely follow Zipf's law (especially for the words with low and high frequency of occurring), how do I justify it mathematically?

Alexis
  • 26,219
  • 5
  • 78
  • 131
Slim Shady
  • 203
  • 9
  • To what does "it" refer? What are you trying to justify? – whuber Feb 13 '22 at 19:13
  • @whuber well my question specifically is: how can I justify (maybe somehow using the formula) why my data doesn't follow the line of the formula. (I know that the data should approximate the line of the formula, but why it's not exact) – Slim Shady Feb 14 '22 at 08:38
  • what is not exact? the fit? the fact that data should follow the law? knowing nothing about the process generating data, there is no way to mathematically prove anything about it. – carlo Feb 16 '22 at 17:35
  • @carlo Yes, the fit. Why in general the data doesn't follow the law (not a specific dataset). I'm not saying it doesn't follow the fit at all, but just maybe at the edges it doesn't follow it and in the middle it's not 100% following the fit. Something like the picture here: https://onlinelibrary.wiley.com/doi/full/10.1080/03640210802020003 – Slim Shady Feb 17 '22 at 07:28

1 Answers1

9

The literature on the mathematical theory underlying Zipf's law is quite vast, and includes a large number of underlying theoretical models in which the law emerges. Zipf's law is related to power laws through the fact that it asserts a power-law relationship for the rank versus frequency of the objects under analysis, so there is also a substantial literature examining the connections between the Zipf distribution and power-law behaviours in the Pareto distribution. The statistical literature on this topic is quite vast, but you can find a good introductory exposition on this field in Mitzenmacher (2003). As you will see from that reference, there are a number of modelling approaches that lead to the behaviour set out in Zipf's law.

For natural language and vocabulary analysis, the most prominent modelling approach is an information-theoretic derivation akin to the work of Mandelbrot (1953). This paper uses information-optimisation to derive a slightly generalised form for Zipf's law; this model has had a large impact in information theory and has led to a range of later models. The approach used by Mandelbrot leads to a slighly generalised form of the Zipf distribution over the support $1,...,N$, defined by the proportionality relationship:

$$f(k|s,c,N) \propto \frac{1}{(k+c)^s} \cdot \mathbb{I}(k \in \{ 1,...,N \}),$$

with parameters $c \geqslant 0$ and $s > 0$. This distributional relationship is often exhibited on a log-log plot via the fact that the distribution satisfies:

$$\begin{align} \log f(k|s,c,N) &= \text{const} - s \log(k+c) \\[6pt] &= \text{const} - s \log(k) - s \log \Big( 1+\frac{c}{k} \Big). \\[6pt] \end{align}$$

In the special case where $c=0$ we see that the rank-frequency relationship will appear as a negative linear relationship on a log-log plot. For $c>0$ the relationship will appear nonlinear, but will become linear for $k \gg c$ (i.e., it will be close to linear except when $k$ is relatively low).


A useful starting point to investigate Zipf's law in empirical data is to plot the rank versus frequency on a log-log plot to see if it appears to roughly follow the above form. You can easily obtain the MLEs for the parameters $c$ and $s$ and use this to superimpose the estimated Zipf distribution onto the log-log plot, to see how closely the data follows the closest version of this distribution. You can also use goodness of fit tests to see if the variation of the data from the theoretical distribution is sufficient to falsify the assumed distributional form.

Now, if your data does depart from Zipf's law (in the generalised sense shown here) then that means you will need to investigate broader distributional forms that accomodate your data. It is a bad idea to try to "justify" Zipf's law against the evidence provided by your data --- you should allow your data to lead the analysis and seek models and distributional forms that are plausible when compared with your data. If your data does not fit the family of Zipf distributions, ideally you would broaden your analysis by examining distributional families that arise from some plausible simple change to the underlying information-theoretic models. Ideally you will end up with a distributional form that has a solid information-theoretic foundation and also fits your data well.

Ben
  • 91,027
  • 3
  • 150
  • 376