Running tests over many samples. How to deal with many p-values?

Question

I have 2 sets of 40 000 independent datasets (composed of 10 000 elements each), that I am comparing using an Anderson-Darling test. More precisely, I compare corresponding subsets of my two sets, so that at the end, I have 40 000 p-values (or AD statistics, to be compared with critical values). Just to give a little more detail, I am comparing two models, that I ran over 40000 input test-cases. These models produce statistical outputs. I try to estimate if a change in the model significantly impacts the output distributions.

My question is the next one: searching for a significant result, should I correct my p-values to take into account the fact I run 40 000 tests (for instance, using a Bonferroni correction) ? As far as I understand, if I select a significant level $\alpha=0.05$, I consider that a significant result appears when the p-value is below 5%, because statistically, it has no more than 5% chances to arise by chance.

However, when I run 40 000 tests I have more chances to get p-values below 0.05, isn't it ?

It seems to me that I could consider my models to be different if more than 5% the computed p-values falls below 0.05, and not just one. Would this be correct ?

Please say more about what you are trying to accomplish here. I worry about any comparison that involves separate tests on 40,000 different data sets. I suspect that there is a way to combine information into a ore direct test between your 2 modeling approaches. — EdM, Jul 26 '20 at 20:16
Thanks! Well, starting from a set of a dozen input parameters, I compute the output distribution of a variable of interest, using a Monte-Carlo model. I have 40000 different sets of input parameters. Recently, I slightly changed my Monte-Carlo model, and I reran this new version with my inputs. As such, I got two sets of 40000 output distributions, each corresponding to a particular set of input parameters. In order to investigate a potential impact on my model changes, I ran AD tests to compare the outputs of the two models... and I'm now trying to obtain a global view of the situation. — Clej, Jul 26 '20 at 23:35

score 1 · Answer 1 · answered Jul 27 '20 at 00:09

The Bonferroni correction (or its more powerful but equally sensitive Holm modification) controls the family-wise error rate, the chance that any of your nominally significant results is a false positive. If that's what you want to control, then you need to consider all 40,000 comparisons. You could instead try to limit the false-discovery rate, the fraction of nominally significant results that are just by chance, again considering all 40,000 comparisons. See the Wikipedia multiple comparisons page.

But it seems that you can avoid the multiple-comparisons problem while answering your practical question much more efficiently, with a different approach. The way you describe throws away any information about how differences between the models depend on the specific choices of input parameters.

It sound like it should be possible to estimate the differences directly as functions of the 12 input parameter values. Putting aside the question of whether the Anderson-Darling test is appropriate in this case, the idea would be to ignore the p-values and use the test statistic itself as a dependent variable, and model that either with a simple linear model in the 12 input values, with more complicated interaction terms, or with other modeling approaches.

That should help document whether the differences are large enough to matter (with pairs of 10,000 data points each per comparison you seem very likely, as with normality testing, to find "statistically significant" discrepancies of little practical importance) while showing how the magnitudes of the discrepancies might depend on certain choices or combinations of the input-parameter values.

That's probably the best way to go. However, would it still make sense to estimate the proportion of cases for which the statistics exceeds the critical values ? I'd say it would give an idea of the magnitude of differences. Or is it useless because the chances to obtain "statistically significant" unsignificant results are too strong ? By the way, is there a clear way to identify when a test fails for this reason ? — Clej, Jul 27 '20 at 08:38
@Clej the problem with distributional tests on very large data sets is that they are too sensitive. For example, real-world data are seldom exactly normal, so normality tests with large data correctly show significant deviations from the null hypotheses. The tests aren’t failing statistically, they just aren’t providing very useful information. Were the null hypothesis of no differences ever between the 2 types of models to hold, the p values would be uniformly distributed over [0,1], but otherwise I don’t think you can’t easily interpret the distribution of p-values. — EdM, Jul 27 '20 at 09:34
That makes sense. Anyway, that seems like an interesting point: may the p-values distribution be more useful than the values themselves ? For instance, having p-values uniformly distributed would be a sign for the null hypothesis to hold, and p-values stuck on the left would indicate that the null hypothesis can be rejected in at least some cases ? — Clej, Jul 29 '20 at 10:06
@Clej the "p-values stuck on the left" are pretty much what the controls for family-wise error rate (FWER) or false-discovery rates (FDR) are looking for: are there more in the "significant" region than you would expect if the null hypothesis were true? The differences among the multiple-comparison corrections are mostly how far out to "the left" you choose to go. Don't know that the actual distribution of p-values means much, except for how it differs from a uniform distribution. — EdM, Jul 29 '20 at 15:33

Running tests over many samples. How to deal with many p-values?

1 Answers1