I've come across the Bonferroni correction in a genetics course. I understand the idea that we need to account for type I errors, but I can't fully get my head around it. In the genetics scenario, it is used to test for associations between pathways and a response. So, if we have 3 pathways, we get 3 p-values (one for each pathway), and then we use a Bonferroni correction to adjust each p-value. However, what if we suddenly add another 2 pathways? Then we need to lower the p-value required for 'significance' for each pathway, through the Bonferroni correction. It seems as if we are 'punishing' the pathways by the existence of other pathways, even though each hypothesis test is independent? The corrections seems like it makes it harder to find significance, compared to just testing each pathway completely by itself.
-
1You have spotted why the Bonferroni correction doesn't really make sense. See https://stats.stackexchange.com/questions/120362/whats-wrong-with-bonferroni-adjustments/ for more info. – fblundun Apr 04 '21 at 19:18
-
The Bonferroni correction is an answer to the critique that a rejection of a null hypothesis isn't very surprising because you tested so many null hypotheses. Run 100 independent tests under the null, and you'd expect 5 to have an estimated pval < .05. Multiple comparison tests like the Bonferroni correction up the hurdle based upon how many other tests you're running. That said, it's an approach that's not without controversy. – Matthew Gunn Apr 04 '21 at 19:28
-
Yes, you must pre-specify the data analysis plan when you use Bonferroni, or FDR, or Bayesian methods, etc. Otherwise, you run the same risk of cherry-picking that the methods are trying to avoid. – BigBendRegion Apr 04 '21 at 20:11
-
See [Wikipedia](https://en.wikipedia.org/wiki/Bonferroni_correction) on 'Bonferroni correction' and its link to Bonferroni's inequality. Because the correction method relies on an inequality if can be too 'conservative'--that is too reluctant to declare significant differences. It is usually OK for avoiding 'false discovery' when running a few _ad hoc_ tests within a 5% overall error rate. – BruceET Apr 04 '21 at 20:31
1 Answers
The corrections seems like it makes it harder to find significance, compared to just testing each pathway completely by itself.
That's true, and in principle for a good reason: the more tests you run, the greater the probability that at least one will be a false positive. So if you want to avoid making false-positive errors, you need to take that fact into account. The more you guard against false-positive errors, however, the greater risk of missing true positive results. The central question becomes what types of errors you are trying to avoid, and the attendant implications for how much "harder" you want to make it "to find significance."
The Bonferroni correction tries hard to avoid any false-positive errors. It seeks to control the family-wise error rate (FWER), the probability that any of the test results is a false positive. If that's your goal then you do have to be more stringent as you evaluate more hypotheses together. It actually tries too hard, as you can get the same control of FWER with less of a chance of missing true positives by using the Holm modification.
If you are willing to accept some false positives, however, you can instead control the false discovery rate (FDR), the fraction of your positive test results that is erroneous. Particularly in large-scale studies like gene expression analysis (where there also will probably be further independent confirmation in follow-up experiments), controlling FDR is preferred as you are less likely to miss important true positive results. That approach largely overcomes your objection with respect to what happens as you add more hypotheses: the more tests that you run, the more positive findings that are typically allowed.
In particular, control of FDR is both adaptive and scalable. Having 1 false-positive error (when controlling for FWER) might be a big deal if you are only performing 2 tests, but might not matter if you are performing 100 tests (adaptive). At, say, 5% FDR each of your positive findings has the same 5% probability of being false: whether you are risking 5 false positives out of 100 tests or 500 false positives out of 10,000 tests probably doesn't matter much (scalable).
The assumptions underlying these tests, however, can be somewhat questionable. In practice, the assumption of many such tests that the hypotheses are independent might be hard to justify. The discussion on the page linked from a comment is a good place to start. When taken to an extreme, FWER correction for multiple testing might seem to require correcting your own hypothesis tests for other people's independent tests on the same data set, which would be both impractical and a sure way to miss a lot of true positive findings.

- 57,766
- 7
- 66
- 187