Equivalent of Kolmogorov-Smirnov test for integer data?

Question

Is there an equivalent of the two-sample Kolmogorov-Smirnov test for integer data (not count data, as it can include negative integers)?

The Kolmogorov-Smirnov test does not perform well in the presence of lots of ties, which are obviously common with integers.

A Google search turns up a lot of useful literature: https://www.google.com/search?q=distribution+test+discrete. Note that most tests that work with count data will work, *mutatis mutandis,* with any discrete distribution. — whuber, Sep 07 '12 at 14:45
This question might also be of interest: [Is there an alternative to the Kolmogorov-Smirnov test for tied data with correction?](http://stats.stackexchange.com/q/35606/10525). — , Sep 08 '12 at 14:49

score 9 · Accepted Answer · 2012-09-07T15:51:40.253

The Permutation test could be applied here as well. The idea is as follows.

Let $X_1,...,X_m\sim F$ and $Y_1,...,Y_n\sim G$ be two independent samples and consider testing the hypothesis $H_0:F=G$ vs. $H_1:F\neq G$. For this purpose, label your data as follows

\begin{array}{c c} 1 & X_1\\ 1 & X_2\\ \vdots & \vdots\\ 1 & X_m\\ 2 & Y_1\\ 2 & Y_2\\ \vdots & \vdots\\ 2 & Y_n\\ \end{array}

Now, let $T$ be an statistic of the sample $S=\{X_1,...,X_m,Y_1,...,Y_n\}$ and the labels $L=\{1,1,...,2,2,...,2\}$.

If $H_0$ is true, then the labeling is superfluous.

Now, permute the group labels and recalculate the test statistic a large number of times, say $B$.

The one-sided p-value of this test is calculated as the proportion of sampled permutations where the difference in means was greater than or equal to $T(S,L)$. The two-sided p-value of the test is calculated as the proportion of sampled permutations where the absolute difference was greater than or equal to $\mbox{abs}(T(S,L))$. See

A toy example

Let $X_i \sim \text{Poisson}(10)$, $i=1,...,m=100$, and $Y_j \sim \text{Poisson}(11)$, $j=1,...,n=100$. Consider the statistic $T=\text{mean of Group 1} - \text{mean of Group 2}$. The permutation method using this statistic is implemented below.

rm(list=ls)
set.seed(1)
# Sample size
ns=100
#Simulated data
x = rpois(ns,11)
y = rpois(ns,10)

# Observed statistic    
T.obs = mean(x) - mean(y)

# Pooled data
SL = rbind(cbind(rep(1,ns),x),cbind(rep(2,ns),y))

# Resampling
B=10000
T = rep(0,B)

for(i in 1:B){
samp = sample(SL[,1])
ind1 = which(samp==1)
ind2 = which(samp==2)
T[i] = mean( SL[ind1,2] )- mean( SL[ind2,2] )
}

# p-value
p.value = length(which(abs(T)>abs(T.obs)))/B

I do not know how robust is this method, but after some experiments it seems to perform moderately well. Note that the choice of the statitic $T$ is open and therefore one must be careful on making a meaningful choice in the context of your problem as the performance depends on both the statistic and the sample size.

I hope this helps.

+1 The usual chi-squared statistic works well. I examined the distributions of bootstrapped p-values for zero-mean *shifted* Poisson distributions ($X + \lfloor\lambda\rfloor \sim \text{Poisson}(\lambda)$) and found good power even with moderately small sample sizes. E.g., with two datasets of $100$ values each, $\lambda=1$ is discriminated from $\lambda=1.4$ with 50% power at $\alpha=.05$. These chi-squared statistics do not appear to have chi-squared distributions, whence the need to bootstrap the p-values. — whuber, Sep 07 '12 at 15:56

Michael R. Chernick · Answer 2 · 2012-09-07T15:36:58.483

1

I would suggest the two sample chi square test where you bin the data and compare the binned total with an "expected number" that would fall within the binbased on the pooled sample. This has a generalization to k greater than 2. I am assuming that you are not requiring another test of the emprical cdf form. I think that entire class of test could have some trouble when there are a lot of ties.

Here is a reference that shows you precisely how the two-sample chi square test statistic is calculated along with the degrees of freedom for the asymptotic chi square distirbution.

edited Sep 07 '12 at 15:36

answered Sep 07 '12 at 00:20

Michael R. Chernick

39,640
28
74
143

Do you suggest that each integer is a bin, or to group integers into bins? If its the latter, is there any rule of thumb for selecting an appropriate number of bins? – fmark Sep 07 '12 at 00:43
@fmark Of course a bin would include a group of integers. There is no rule of thumb that I can think of that would work in general. The idea is to have enough bins so that the histogram is not too smooth and not overly variable. It is very much like choosing a bandwidth for a kernel smoother. – Michael R. Chernick Sep 07 '12 at 00:48
Thanks Michael. There is a lower bound on the cell size for a chi-square test (i.e. > 5). Apart from that, would you place some indicative bounds on the number of bins (e.g. 2 < n < 13 or something like that)? I'm not sure how smooth the histogram *should* look. – fmark Sep 07 '12 at 01:00
Yes the expected cell size should be >5 for each cell. That is because thechi square distribution is asymptotic and is not a good approximation when the cell sizes are small. – Michael R. Chernick Sep 07 '12 at 02:00
2

How does one apply the chi-square test for a *two-sample* application? You need a reference distribution but you don't have one in that case. How do you find it? What would be the degrees of freedom to use? – whuber Sep 07 '12 at 14:56
@whuber Sorry chi-square is a one sample test with an assumed reference distribution. In the two-sample case the question is not about fit to a reference distribution but rather a question of whether the two distribution can be determined to be different based on their sample. Although I don't recall hearing about such a thing but what if we arbitrarily make one distribution the reference by using its cell frequencies in place of the expected frequencies of the reference distribution. – Michael R. Chernick Sep 07 '12 at 15:17
I imagine that the asymptotic distribution of the test statistic would look very much like a chi square distribution (if it isn't exactly one). There must be some literature on this. I will do a search and get back on this. – Michael R. Chernick Sep 07 '12 at 15:17
2

I did some experimentation and found the distribution doesn't look chi-square even for some largish datasets (*e.g.*, comparing $1000$ values to $100$ values and each bin with more than $5$ values (typically).) – whuber Sep 07 '12 at 15:33
@whuber Please note my edited answer. The test statistic is done a little differently than in my comment but it does have an asymptotic chi square distribution according to the reference. – Michael R. Chernick Sep 07 '12 at 15:38
@whuber If you look at the two bin data sets they are categories with counts. If you use the same k bins for both data sets we can construct a kx2 contingency table out of it. Then the usual chi square test for independence between the columns of the table serves as a 2 sample chi square test and it has an asymptotic chi square distribution under the null hypothesis. For the m-sample problem with m>2 we just apply this to the analogous kxm table. Thinking about this I feel that this is something we both should have known. Somehow it didn't come to me initially. – Michael R. Chernick Sep 07 '12 at 15:47
2

Are you suggesting I didn't know this? :-) But because in practice we tend to get small counts out in the tails, the chi square approximation doesn't work well. A permutation test (or, simply, bootstrapping the chi square statistic) does work: that's what I did the experimentation to confirm. – whuber Sep 07 '12 at 15:52
@whuber I am not intending to suggest anything about your knowledge of statistics. But the way you address your comments about my answer seemed to ignore this even when you discussed having done simulations. The idea for the test I think should be to bin in a way that does not create sparse cells and yet does not create an overly smooth histogram. You don't think I don't know that the chi-square test is a poor approximation when there are sparse cells do you? I would agree that bootstrapping the null distribution of the chi square statistic would be good in a sparse cell situation. – Michael R. Chernick Sep 07 '12 at 16:01
@whuber I think that we discussed thuis once before on a different post. Maybe you knwo the link? – Michael R. Chernick Sep 07 '12 at 16:02

Equivalent of Kolmogorov-Smirnov test for integer data?

2 Answers2

Linked