3

I have a dataset with hashtags and their frequencies (~370k frequencies), as for example (after a sort):

373827 hashtag_1
373826 hashtag_2
373826 hashtag_3
373826 hashtag_4
373825 hashtag_5
373823 hashtag_6
373823 hashtag_7
373822 hashtag_8

and I want to check if this frequencies follows Zipf-Mandelbrot's law (link) through fitting in R. So I must estimate the two parameters of the law (the exponent s and q following Wikipedia page). For this purpose I have decided to use function zm.ll of tolerance package that uses ML estimation, obtaining s~1,65 and q~0.085.

Now I'd understand if this estimation is correct through Chi-square test but I didn't understand how to do this test in R and the interpretation of the p-value returned. I've tried with this that return p.val=0:

# ZipfMan law where y is the frequencies
x           <- (((1:length(y))+0.085)^1.65)
p           <- x/sum(x)
tab         <- tabulate(y, length(y))
discrepancy <- (tab-length(y)*p)^2/(length(y)*p)
chi.stat    <- sum(discrepancy)
p.val       <- pchisq(chi.stat, df= length(y)-2-1, lower.tail = FALSE)
c(p.val, chi.stat)

I have some doubts about this piece of code (also about the choice of degree of freedom) and the result returned by it and I have difficulties to interpret p-value.

gung - Reinstate Monica
  • 132,789
  • 81
  • 357
  • 650
Daniele
  • 41
  • 5

1 Answers1

1

I've resolved using max likelihood estimation for fitting parameters and the probability mass function of Zipf-Mandelbrot to compute probabilities and expected frequencies. Recall the problem: i'd check if the occurrences of hashtags (and not the frequencies of the occurrences) follow a Zipfian distribution (Zipf or ZM).

# use lzipfman for plotting
lzipfman <- function(alpha, beta) -s*log((1:length(data$V2)) + beta)-log(sum(1/((1:length(data$V2))+beta)^s))

##### MLE estimation #####
zm.ll <- function(alpha, beta) {
  sum(freq*(alpha*log(1:length(data$V2)+beta)+log(sum(1/(1:length(data$V2)+beta)^alpha))))
}

zm.fit <- mle(minuslogl = zm.ll, start = list(alpha=1, beta=0.1))
summary(zm.fit)

##### Probabilities and test #####
probs =(1/(1:length(data$V2)+beta)^alpha)/sum(1/(1:length(data$V2)+beta)^alpha)

chisq.test(data$V2, p = probs)
Daniele
  • 41
  • 5