Questions tagged [zipf]

36 questions
27
votes
7 answers

How to calculate Zipf's law coefficient from a set of top frequencies?

I have several query frequencies, and I need to estimate the coefficient of Zipf's law. These are the top frequencies: 26486 12053 5052 3033 2536 2391 1444 1220 1152 1039
Diegolo
  • 289
  • 1
  • 3
  • 4
12
votes
3 answers

How to estimate parameters for Zipf truncated distribution from a data sample?

I have a problem with the estimation parameter for Zipf. My situation is the following: I have a sample set (measured from an experiment that generates calls that should follow a Zipf distribution). I have to demonstrate that this generator really…
Maurizio
  • 265
  • 2
  • 9
10
votes
2 answers

Is KS test really appropriate when validating a power law/estimating power law parameters?

I'm attempting to find out whether some highly skewed data are drawn from a power law distribution, following the popular paper by Clauset, Shalizi and Newman, 2009. Clauset et al. use the Kolmogorov-Smirnov (KS) statistic to measure the…
7
votes
1 answer

Connection between power law and Zipf's law

I am trying to better understand the connection between the power law distribution and Zipf's distribution (law). There is a neat explanation in [1]. The article suggests that as we can derivate the power law function from Pareto's law, combined…
fsociety
  • 1,084
  • 1
  • 12
  • 25
7
votes
3 answers

Is principal components analysis valid if the distribution(s) are Zipf like? What would be similar to PCA but suited to non gaussian data?

I'm analyzing people based on their twitter stream. We are using a 'word bag' model of users, which basically amounts to counting how often each word appears in a persons twitter stream (and then using that as a proxy for a more normalized…
utunga
  • 173
  • 1
  • 4
7
votes
2 answers

How to verify if data follows Zipf's law without looking at the graph

I want to check if a given text sample was written by real people or not so I think Zipf's law could help. If data follows Zipfian distribution, the most frequent word will occur approximately twice as often as the second most frequent word, three…
anvoz
  • 171
  • 1
  • 4
5
votes
1 answer

If my data doesn't completely follow the Zipf's law, how do I justify it mathematically?

Zipf's law states that in a text set $s=1$ a few words occur very often, and many words hardly ever occur. Zipf’s law for text sets $s = 1$ in the Zipf distribution defined by: $$f(k; s, N) = \frac{k^{-s}}{\sum^N_{i=1}i^{-s}}$$ where $f(·)$ denotes…
Slim Shady
  • 203
  • 9
5
votes
0 answers

Zipf's vs Self-similar: are they really the same

Recently I ran into a test using both zipf's and self-similar generated datasets. I followed the description from Jim Gray's paper on generating such datasets (Quickly Generating Billon-Record Synthetic Databases). In that paper it mentions: It is…
asksw0rder
  • 163
  • 4
5
votes
1 answer

Besides the Pareto and Zipfian distributions, which distributions obey the power-law?

I need a list of distributions that obey the power-law, beside the commonly used Pareto and Zipfian distributions. A comprehensive list or a reference to a comprehensive list will be particularly appreciated.
4
votes
2 answers

Characterizing/Fitting Word Count Data into Zipf / Power Law / LogNormal

Using NLTK and Pandas, I was able to process some text files and generate word count data for them, and finally create a histogram describing word frequency. However, I'm wondering what kind of analysis should I do in order to characterize this…
born to hula
  • 101
  • 6
4
votes
1 answer

How to determine if Zipf's Law can be applied?

I'm trying to use feature hashing (or hashing trick) on a set of files that are composed mostly in English and code (in English). I'm trying to see if Zipf's law applies to the set of files before I try to use feature hashing. Looking at 3 different…
user1157751
  • 517
  • 1
  • 6
  • 17
3
votes
1 answer

Discrete Pareto Distribution vs Zipf Distribution and Power Law vs Zipf Law

I need to get a simple, but clear idea of Discrete Pareto Distribution vs Zipf Distribution and Power Law vs Zipf Law. (Are they similar/ how they relate to each other.) Wikipedia definitions do not address my issue. If graphical explanation is…
3
votes
1 answer

Fitting Zipf Mandelbrot and use Chi-square test in R

I have a dataset with hashtags and their frequencies (~370k frequencies), as for example (after a sort): 373827 hashtag_1 373826 hashtag_2 373826 hashtag_3 373826 hashtag_4 373825 hashtag_5 373823 hashtag_6 373823 hashtag_7 373822 hashtag_8 and I…
Daniele
  • 41
  • 5
3
votes
2 answers

How to asses the optimal bag of words vector size?

I have a corpus with 6040592 words and 309074 types (different words). Knowing this information it is possible to know the optimal size of bag of words vectors in order to represent phrases? I am using a data structure like this: {'contains(The)':…
3
votes
0 answers

Mean and median in Zipf's distribution

I have a collection of $10^5$ essays, each of which on average contains $10^3$ distinct words. There are $10^6$ distinct words in the entire collection. If I index every word what is the mean and median size of the inverted index lists? My guess is…
matcheek
  • 375
  • 3
  • 12
1
2 3