Testing for uniformity of p-values with multi-modal samples

Question

I'm working with data that is multi-modal, I need to be able to check if the individual samples are statistically distinct or not, so I'm running KS-test against pairs of samples.

But I've noticed that p-values below 0.05 were showing up less often than expected with samples that should be similar.

So I've ran a simulation with a simple bimodal distribution:

n <- 10000
nsamp <- 10000
ps <- replicate(nsamp, {
   y1 <- c(rnorm(n/2), rnorm(n/2, 5, 2))
   y2 <- c(rnorm(n/2), rnorm(n/2, 5, 2))
   tt <- ks.test(y1, y2)
   tt$p.value
})
plot(ecdf(ps))
ks.test(ps, 'punif')
plot(ecdf(runif(100000)), add=T, col="red")
plot(ecdf(rbeta(100000, 2, 1)), add=T, col="blue")

To my surprise, the p-values are not uniformly distributed, rather they follow a distribution similar to beta distribution with parameters alpha=2 and beta=1.

Question 1 Do I interpret it correctly that KS-test is more sensitive to departures from expected values in multi-modal distributions than in unimodal distributions? i.e. normally distributed samples are worst case scenario for KS-test?

Question 2 Should I rather perform a test that the p-values are stochastically greater than uniformly distributed, not that they are uniformly distributed (i.e. something like ks.test(ps, 'punif', alternative='greater'))?

Edit 1: removed sample() from functions.

Edit 2:

While in the example above I'm using a simple concatenation to add the observations from two different distributions, I do have a reason to believe this is correct approach to model the real-world observations.

The data in question comes from few different experiments, the values in question are reaction times. Now, because the reaction time is in the order of 100µs while I'm interested in differences down to few ns, I need to collect a lot of observations. To reduce bias from running the experiments in exact same order (say ABC ABC ABC ABC, etc. with A, B and C being individual test classes) I'm randomising the order in which I run them, but I still run them in groups (e.g. ABC CBA BAC CAB, etc.).

Now, because I run hundreds of thousands of tests, it takes time.

If I have a noise that is active for a continuos period of time but only for part of the time it takes to run the test, then the actual collected data will look like a concatenation of two distributions, not a random selection from two distributions. So I think I'm correct to model it through c(rnorm(), rnorm()) rather than ifelse(binom(), rnorm(), rnormo()).

Some difficulties here include (a) your use of `sample` is futile because it just shuffles the order of the 10,000 mixed-value samples, and K-S test is oblivious to order. (b) n = 10,000 is a very large sample for a goodness-of-fit test, especially K-S. With bimodal samples you're asking if both of dist'ns mixed match and if their proportions match. Any quirk can lead to large vertical distance btw the two ECDFs. (c) K-S test statistic is not continuous, so P-value under $H_0$ is not necessarily uniform. (Try comparing two standard normal samples of size 10 or 100 or 1000 to see results.) — BruceET, Jun 30 '20 at 23:26
if KS-test is a wrong tool, what test should I use to compare samples this size or larger? I need to detect effect sizes down to at least 0.1% of the mean value, don't I need samples with at least a 1 million observations for that? If p-value is not necessarily uniform, how can I say if the low p-values are a coincidence or proof of difference? (I can obtain new samples, but not hundreds of them) — Hubert Kario, Jun 30 '20 at 23:37
If you need to test whether means are equal then try to do that. See if huge sample sizes make 2-sample t tests trustworthy. If you want to see if one sample stochastically dominates another, then use 2-sample Wilcoxon. // K-S is not useless by any means, but it seems your're asking too much of it. — BruceET, Jun 30 '20 at 23:52
It's not only means that I'm after, I want to be able to detect any kind of difference, including when one of the 4 or 5 distributions that create the sample changed a single parameter slightly. And I don't know what kind of distributions those are or if one dominates or not. So I don't think either t test or Wilcoxon apply. — Hubert Kario, Jul 01 '20 at 00:03
@BruceET I've ran the same script, just with `wilcox.test()` instead `ks.test()`; the p-values are even more skewed towards 1 than with KS-test. — Hubert Kario, Jul 01 '20 at 00:17
Tried the same. Trouble isn't just bimodality. BETA(.05, .05) is 'bathtub' shaped with high points at 0 and 1. Testing for equality of two of those seems to go well. Your combination of 2 normals is certainly bimodal, but isn't a true mixture dist'n. Will continue to investigate (off & on as time permits) and leave a Comment if I discover anything interesting. Puzzling anomaly, for sure. — BruceET, Jul 01 '20 at 07:09

score 2 · Answer 1 · answered Jul 01 '20 at 00:54

Your problem here is that y1 and y2 are not independent samples from the same continuous distribution.

It looks like you're trying to sample from a 50:50 mixture of $N(0,1)$ and $N(5,2^2)$, but if you actually do that the number from each component will vary. You're leaving out that variation. It wasn't (to me) a priori obvious which way this would bias the KS test, but it will bias it; the null is not true.

If you really sample from the 50:50 mixture, like this

n <- 10000
nsamp <- 10000
ps <- replicate(nsamp, {
   y1 <- ifelse(rbinom(n,1,.5)==1, rnorm(n), rnorm(n, 5, 2))
   y2 <- ifelse(rbinom(n,1,.5)==1, rnorm(n), rnorm(n, 5, 2))
   tt <- ks.test(y1, y2)
   tt$p.value
})
plot(ecdf(ps))
abline(0,1,col='red')

you get uniformity, like this

"It looks like you're trying to sample from a 50:50 mixture", well, not really, I'm trying to reproduce the skewed p-values I've seen from real-world data with ks-test. And while I can't know for sure, I can't exclude the possibility of observations being a string of AABAABAABAABAAB, etc. (or a close approximation of it) — Hubert Kario, Jul 01 '20 at 11:53
Actually, when I think about it, that may be exactly what is happening: I'll update the question with details. — Hubert Kario, Jul 01 '20 at 11:56

score 0 · Accepted Answer · answered Jul 02 '20 at 16:24

To answer my questions:

Question 1 Do I interpret it correctly that KS-test is more sensitive to departures from expected values in multi-modal distributions than in unimodal distributions? i.e. normally distributed samples are worst case scenario for KS-test?

no, this a sign of data not meeting requirements of the test used, it this case, the samples are not independent

Question 2 Should I rather perform a test that the p-values are stochastically greater than uniformly distributed, not that they are uniformly distributed (i.e. something like ks.test(ps, 'punif', alternative='greater'))?

most likely not, wrong test will give wrong results (in this case it will underestimate differences between samples)

Testing for uniformity of p-values with multi-modal samples

2 Answers2