Questions tagged [zipf]
36 questions
27
votes
7 answers
How to calculate Zipf's law coefficient from a set of top frequencies?
I have several query frequencies, and I need to estimate the coefficient of Zipf's law. These are the top frequencies:
26486
12053
5052
3033
2536
2391
1444
1220
1152
1039

Diegolo
- 289
- 1
- 3
- 4
12
votes
3 answers
How to estimate parameters for Zipf truncated distribution from a data sample?
I have a problem with the estimation parameter for Zipf. My situation is the following:
I have a sample set (measured from an experiment that generates calls that should follow a Zipf distribution). I have to demonstrate that this generator really…

Maurizio
- 265
- 2
- 9
10
votes
2 answers
Is KS test really appropriate when validating a power law/estimating power law parameters?
I'm attempting to find out whether some highly skewed data are drawn from a power law distribution, following the popular paper by Clauset, Shalizi and Newman, 2009.
Clauset et al. use the Kolmogorov-Smirnov (KS) statistic to measure the…

JaydenM-C
- 103
- 4
7
votes
1 answer
Connection between power law and Zipf's law
I am trying to better understand the connection between the power law distribution and Zipf's distribution (law). There is a neat explanation in [1].
The article suggests that as we can derivate the power law function from Pareto's law, combined…

fsociety
- 1,084
- 1
- 12
- 25
7
votes
3 answers
Is principal components analysis valid if the distribution(s) are Zipf like? What would be similar to PCA but suited to non gaussian data?
I'm analyzing people based on their twitter stream. We are using a 'word bag' model of users, which basically amounts to counting how often each word appears in a persons twitter stream (and then using that as a proxy for a more normalized…

utunga
- 173
- 1
- 4
7
votes
2 answers
How to verify if data follows Zipf's law without looking at the graph
I want to check if a given text sample was written by real people or not so I think Zipf's law could help.
If data follows Zipfian distribution, the most frequent word will occur approximately twice as often as the second most frequent word, three…

anvoz
- 171
- 1
- 4
5
votes
1 answer
If my data doesn't completely follow the Zipf's law, how do I justify it mathematically?
Zipf's law states that in a text set $s=1$ a few words occur very often, and many words hardly ever occur. Zipf’s law for text sets $s = 1$ in the Zipf distribution defined by:
$$f(k; s, N) = \frac{k^{-s}}{\sum^N_{i=1}i^{-s}}$$
where $f(·)$ denotes…

Slim Shady
- 203
- 9
5
votes
0 answers
Zipf's vs Self-similar: are they really the same
Recently I ran into a test using both zipf's and self-similar generated datasets. I followed the description from Jim Gray's paper on generating such datasets (Quickly Generating Billon-Record Synthetic Databases). In that paper it mentions:
It is…

asksw0rder
- 163
- 4
5
votes
1 answer
Besides the Pareto and Zipfian distributions, which distributions obey the power-law?
I need a list of distributions that obey the power-law, beside the commonly used Pareto and Zipfian distributions. A comprehensive list or a reference to a comprehensive list will be particularly appreciated.

PatternRecognition
- 623
- 5
- 19
4
votes
2 answers
Characterizing/Fitting Word Count Data into Zipf / Power Law / LogNormal
Using NLTK and Pandas, I was able to process some text files and generate word count data for them, and finally create a histogram describing word frequency.
However, I'm wondering what kind of analysis should I do in order to characterize this…

born to hula
- 101
- 6
4
votes
1 answer
How to determine if Zipf's Law can be applied?
I'm trying to use feature hashing (or hashing trick) on a set of files that are composed mostly in English and code (in English).
I'm trying to see if Zipf's law applies to the set of files before I try to use feature hashing.
Looking at 3 different…

user1157751
- 517
- 1
- 6
- 17
3
votes
1 answer
Discrete Pareto Distribution vs Zipf Distribution and Power Law vs Zipf Law
I need to get a simple, but clear idea of Discrete Pareto Distribution vs Zipf Distribution and Power Law vs Zipf Law. (Are they similar/ how they relate to each other.) Wikipedia definitions do not address my issue. If graphical explanation is…

Dovini Jayasinghe
- 269
- 1
- 13
3
votes
1 answer
Fitting Zipf Mandelbrot and use Chi-square test in R
I have a dataset with hashtags and their frequencies (~370k frequencies), as for example (after a sort):
373827 hashtag_1
373826 hashtag_2
373826 hashtag_3
373826 hashtag_4
373825 hashtag_5
373823 hashtag_6
373823 hashtag_7
373822 hashtag_8
and I…

Daniele
- 41
- 5
3
votes
2 answers
How to asses the optimal bag of words vector size?
I have a corpus with 6040592 words and 309074 types (different words). Knowing this information it is possible to know the optimal size of bag of words vectors in order to represent phrases?
I am using a data structure like this:
{'contains(The)':…

alemol
- 131
- 2
3
votes
0 answers
Mean and median in Zipf's distribution
I have a collection of $10^5$ essays, each of which on average contains $10^3$ distinct words. There are $10^6$ distinct words in the entire collection. If I index every word what is the mean and median size of the inverted index lists?
My guess is…

matcheek
- 375
- 3
- 12