I have a dataset with hashtags and their frequencies (~370k frequencies), as for example (after a sort):
373827 hashtag_1
373826 hashtag_2
373826 hashtag_3
373826 hashtag_4
373825 hashtag_5
373823 hashtag_6
373823 hashtag_7
373822 hashtag_8
and I want to check if this frequencies follows Zipf-Mandelbrot's law (link) through fitting in R. So I must estimate the two parameters of the law (the exponent s
and q
following Wikipedia page). For this purpose I have decided to use function zm.ll
of tolerance
package that uses ML estimation, obtaining s~1,65
and q~0.085
.
Now I'd understand if this estimation is correct through Chi-square test but I didn't understand how to do this test in R and the interpretation of the p-value returned. I've tried with this that return p.val=0:
# ZipfMan law where y is the frequencies
x <- (((1:length(y))+0.085)^1.65)
p <- x/sum(x)
tab <- tabulate(y, length(y))
discrepancy <- (tab-length(y)*p)^2/(length(y)*p)
chi.stat <- sum(discrepancy)
p.val <- pchisq(chi.stat, df= length(y)-2-1, lower.tail = FALSE)
c(p.val, chi.stat)
I have some doubts about this piece of code (also about the choice of degree of freedom) and the result returned by it and I have difficulties to interpret p-value.