1

Starting from Poisson distribution and cluster analysis

I am trying to find a statistical/empirical method in order to test if my Index of Dispersion (https://en.wikipedia.org/wiki/Index_of_dispersion) is significant or not. I am not a statistician so please forgive any possible mistake I made on the calculation.

Here my data:

df = read.table(text = 'Year Count
  1975   10
  1976   12
  1977    9
  1978   14
  1979   14
  1980   11
  1981    8
  1982    7
  1983   10
  1984    8
  1985   12
  1986    9
  1987   10
  1988    9
  1989   10
  1990    9
  1991   11
  1992   12
  1993    9
  1994   10
  1995    8
  1996   12
  1997   11
  1998   13
  1999    7
  2000   13
  2001   10
  2002    9
  2003    8
  2004   13
  2005   15
  2006   11
  2007   10
  2008   11
  2009    9
  2010   10
  2011    8
  2012   11
  2013   10
  2014    6, header = TRUE)

Therefore, my Index of Dispersio phi will be equal to:

phi = var(df$Count) / mean(df$Count)

> print(phi)
[1] 0.4137045

So my data show underdispersion because 0 < phi < 1

How to test the significance of phi?

I couldn't find any specific test therefore I tried with a simulation of 10,000 random vectors created from a uniform distribution (i.e. each observation with same probability to occur).

Here my 'test' code:

#create list
list = lapply(1:10000, function(x) x = data.frame(round(runif(409, 1975, 2014)))) #409 is the total number of observations for each vector and is equal to sum(df$Count)

#count how many observations per year
list_tbl = lapply(list, function(y) y = data.frame(table(y$round.runif.409..1975..2014..)))

#calculate the index of dispersion for each vector
list_phi = lapply(list_tbl, function(z) z = var(z$Freq) / mean(z$Freq))

#unlist in order to have all the indexes in one df
sim_phi = unlist(list_phi)

#hist of indexes
hist(sim_phi)

#print mean, standard deviation and variance
> mean(sim_phi)
[1] 1.129969
> sd(sim_phi)
[1] 0.2422462
> var(sim_phi)
[1] 0.05868323

Can I affirm that my phi = 0.4137045 is not significant because the simulation pointed out mean(sim_phi) = 1.129969 showing then overdispersion?

no_one
  • 11
  • 4
  • 1
    Significant in what sense? – mdewey Oct 14 '16 at 14:39
  • p-values. significant in the sense to check if my VMR index has came out by chance or not. – no_one Oct 14 '16 at 14:45
  • This question seems perhaps related to [this recent question](http://stats.stackexchange.com/questions/240039/poisson-distribution-and-cluster-analysis). If this is the same user (?), then you should mention the context by linking to that question (and [merge accounts](http://stats.stackexchange.com/help/merging-accounts)). – GeoMatt22 Oct 18 '16 at 19:57
  • Based on the results of the simulation, wouldn't a test of whether or not the simulated mean is significantly different from zero answer your question? – Mike Hunter Oct 18 '16 at 20:19

1 Answers1

2

If you look at your first link, Index of Dispersion, under the "Statistics" section, it says (see Wikipedia link for definitions)

If the variates are Poisson distributed then the index of dispersion is distributed as a $\chi^2$ statistic with $n-1$ degrees of freedom when $n$ is large and $μ>3$. For many cases of interest this approximation is accurate and Fisher in 1950 derived an exact test for it.

and gives a citation for the relevant paper

Frome, E. L. (1982). Algorithm AS 171: Fisher's Exact Variance Test for the Poisson Distribution. Journal of the Royal Statistical Society. Series C (Applied Statistics), 31(1), 67-71.

This would seem to be the standard significance test relevant to your case, under the null hypothesis that your data are i.i.d. Poisson.


In response to the comment: I do not use R, so I cannot say for sure (though this may be it).

However, your data has $n=40$ and $\bar{x}=10.25\gg 3$, so the chi-squared approximation should be reasonable. Given your sample variance of $s^2=4.23$, then by the formula cited in the reference above, the statistic $$t=\frac{s^2(n-1)}{\bar{x}}=16.23$$ should follow a chi-squared distribution with $n-1=39$ degrees of freedom, under the null hypothesis.

From the chi-squared CDF, under the null hypothesis you would then have $$\Pr[t\leq{16.23}]=4.5\times 10^{-4}$$ So you can reject the null hypothesis with a very high degree of certainty (your p-value is more than two orders of magnitude smaller than the "default" value of $5\times 10^{-2}$ given in the tag wiki). As noted in my answer to the other question, this does not seem like a borderline case at all.

GeoMatt22
  • 11,997
  • 2
  • 34
  • 64
  • thanks! any chance to know if there exists an R package to perform the test? I searched a lot but I didn't find anything – no_one Oct 18 '16 at 22:18
  • @no_one I updated the question. To all, I *think* my significance-test example is correct (numbers and *terminology*), but as I am not much of a statistician, corrections welcomed! – GeoMatt22 Oct 18 '16 at 23:14
  • thanks @GeoMatt22. you found what I was looking for. however, regarding the test formula, I think is not correct. if you have a look at http://link.springer.com/article/10.1023/A:1015126322223 equation (1) seems the one proposed by Fisher 1950. – no_one Oct 19 '16 at 08:43
  • no, sorry. please ignore my comment :) – no_one Oct 19 '16 at 08:47
  • by the way, I don't understand how you created the formula (from the reference) and how you calculated the p-value. sorry.. – no_one Oct 19 '16 at 09:30
  • I took the formula from first 3 sentences of ref, [here](http://www.jstor.org/stable/2347079). The CDF formula is on the top right "PDF facts" box on the Wikipedia page (I used `gammainc()` in Matlab). – GeoMatt22 Oct 19 '16 at 11:58