Should multiple testing correct with bonferroni ever reduce a p value's size?

Question

I just did a series of 40 t-tests and then proceeded to use bonferroni correction for multiple testing and my P values reduced in size. Why does this happen? I was my impression that multiple tests correction would always result in an increase in p value size.

Before correction:

           substrate              p_value
19           glycan 0.000139091904182433
23 dermatan sulfate 0.000139091904182433
4            chitin 0.000294435140367691
22           xylose  0.00387472305660014
5       beta-glucan  0.00552400891530821
2         cellulose   0.0130881279666714

After correction:

          substrate      p_value
10           glucose 5.110415e-21
29          fructose 1.745709e-20
26            lignin 7.090204e-18
30 cyclomaltodextrin 3.569263e-10
31  lacto-N-tetraose 3.569263e-10
32       hyaluronate 3.569263e-10

Code used to generate the P values:

# loop labeling one substrate as "A" and every other substrate as "B" then doing a T test between then counts
sub_pvals = NULL
for(sub in unique(cazy_cata_melt$Substrate)){
  df= cazy_cata_melt
  df[df$Substrate != sub,]$Substrate = "B"
  df[df$Substrate == sub,]$Substrate = "A"
  input = cbind(substrate = sub, p_value = t.test(value ~ Substrate, data = df)[[3]][1])
  sub_pvals = rbind.data.frame(sub_pvals, input)
}

#correction for multiple testing
    sub_pvals$p_value = p.adjust(sub_pvals$p_value, method = "bonferroni", n = length(unique(cazy_cata_melt$Substrate)))

#ordering the dataframe
sub_pvals = sub_pvals[order(sub_pvals$p_value),]

Data available here: https://pastebin.com/vsbYGkQW

That should never happen. If you post the code you used to run the correction, it'll make it much easier to figure out what went wrong. And something did go wrong - not only are the p-values getting more significant, they're doing so by a factor of billions or more, which shouldn't happen in either direction with only 40 tests to correct for. This result would be strange even with the before/after values switched. — Nuclear Hoagie, Jul 24 '20 at 14:50
Hi @NuclearWang I have added the code and will add the dataset also. EDIT: data added — Lamma, Jul 24 '20 at 14:55
The before and after lists are of different substrates so...it's hard to say from your post. — Matt Krause, Jul 24 '20 at 14:59
@MattKrause That is just because I have ordered to show smallest P values — Lamma, Jul 24 '20 at 15:01
The correction shouldn't affect the ordering by p-values, so something is clearly wrong with the coding. — EdM, Jul 24 '20 at 15:02
@EdM Indeed as it would seem! I have shared the code above as I myself am not sure what I did wrong. — Lamma, Jul 24 '20 at 15:03
After looking at the code I'm more worried about potentially serious statistical problems: you shouldn't be doing t-tests the way that you are, serially comparing each of the substrates against the means of all the others, and a quick look at your raw data makes me wonder whether t-tests would even be appropriate as the values all seem to be small integers. Please say more about what you are trying to accomplish, as I think that the Bonferroni correction coding problem is much less serious than are the apparent problems with your overall approach. — EdM, Jul 24 '20 at 15:12
@EdM I want to see if the counts for any substrate stand out as significant from the rest. I originally did this with an anova but that does a pairwise style so not to much help in an overarching sense, let me know if that does not make sense :) — Lamma, Jul 24 '20 at 15:15

score 0 · Answer 1 · answered Jul 24 '20 at 15:37

0

The underlying data are counts associated with each of a set of 40 substrates, so putting aside the coding problem (which isn't really on-topic here) the approach has two problems: t-tests aren't correct for count data, and serially comparing each substrate against the mean of all the other substrates isn't a correct way to do these tests.

Count data are best analyzed as Poisson or negative-binomial data. This is possible for example with the glm() function in R. In your case that would be set up similarly to an ANOVA, coding the substrates as levels of a single categorical predictor. The analysis would be performed with an underlying error distribution (needed to asses the significance of any differences) appropriate to count data, for which the normal distribution assumed by ANOVA and t-tests doesn't hold.

You start with the significance of the overall model. If the model as a whole isn't significant you stop and don't proceed to individual comparisons. If the model is significant overall, there are much better (and more powerful) ways to examine differences among the individual substrates; see this answer for an example with count data.

answered Jul 24 '20 at 15:37

EdM

57,766
7
66
187

How do i determine if a model is significant and then how do I decide is reporting estimated marginal means is the right thing to do? – Lamma Aug 02 '20 at 08:23
@Lamma if you run a model appropriate for count data (Poisson, quasi-Poisson, negative binomial) of the form `counts ~ substrate` with `substrate` a multi-category factor variable, you will get a p-value for the _entire model_ that tests whether there are any significant differences among substrates with respect to counts. If that test shows significance, then examining estimated marginal means (e.g., via the `emmeans` package) provides principled ways to examine differences among the substrates. That pools information from all substrates in a way that's superior to repeated simple t-tests. – EdM Aug 02 '20 at 14:25
Thank you this makes sense :) I am still a bit stuck on how to interpret the `emmeans` though. I have looked over the package "basic of emmeans" it still don't really get hem – Lamma Aug 02 '20 at 15:35
@Lamma see the end of my answer [here](https://stats.stackexchange.com/a/447334/28500), based on this comment on the original version of that answer by `emmeans` author Russ Lenth: "Look at the documentation in **emmeans** for ‘contrast()‘ and `eff.emmc`. In short, specifying “eff” contrasts gives comparisons of each mean with the grand mean." I think that's what you want. – EdM Aug 03 '20 at 15:31

Should multiple testing correct with bonferroni ever reduce a p value's size?

1 Answers1