Can Bonferroni be applied for dependent multiple tests?

Question

Suppose we have 3 classes (A, B and C). Their members took a math test and a philosophy test.

What I want to know is which class performs better on math than the others, and which class performs better on philosophy. They task two different tests, but I want to discuss these subjects completely separately.

If there is no correlation between the scores of math test and those of philosophy test, I think we can simply apply Bonferroni correction. Since we have 3 classes, the significance level is 0.05/3. Then use t-test (or something similar), and say A and B got significantly different scores in the math test but the difference in the philosophy test is not significant, for instance. (Here we used t-test 6 times in total since we have to consider 2 subjects times 3 classes.) Is this idea correct?

What if there is a correlation, meaning students who are good at math tend to be bad at philosophy and vice versa? Do I have to lower the significance level? If I have to, what significance level should be used?

What exactly do you want to test? (1) That the three classes perform differently on math? Or (2) that the three classes perform differently on philosophy? Or (3) that they perform differently on at least one of math/ phil? — , Sep 20 '16 at 05:19
@Björn Because the scores of math test and the scores of philosophy test are independent. I think we have to correct alpha when we re-use the same data. — , Sep 20 '16 at 12:14
@fcop I have two separate questions: do they perform differently on math and do they perform differently on philosophy? (If possible, I want to know which class does better on math and whch class does better on philosophy.) I do not compare the result of math with that of philosophy — , Sep 20 '16 at 12:24
@Nickel Why do you think we only have to adjust for alpha when you re-use the same data (in a way it's data on the same students, right?)? I'd agree that in many settings it's a matter of opinion/tradition/convention (unless there are regulatory or legal requirements) what one adjusts for and when, but that particular convention is not one I had seen before. — Björn, Sep 20 '16 at 17:52
@Björn Now I came to know the word, familywise error. (I made the same comment to fcop's answer) I do not think the familywise error is applied to my case. What I want to know is, for example, class A does better on math than B does. I will not conclude like class A is better at studying than B is because A got higher scores on at least one subject. I also learned tests of risk factors for diseases. They test each risk factor separately, so do not divide alpha by the number of possible risk factors (the number of tests they do). Isn't my case similar to the risk factor case? — , Sep 21 '16 at 00:06

score 3 · Accepted Answer · edited Apr 13 '17 at 12:44

In order to show that the Bonferroni correction controls the familywise error rate you do not need to assume independence, so the type I error will be controlled familywise if you do the Bonferroni correction. That ''proof'' is based on the Boole inequality and that inequality holds in cases of dependence and independence.

So if your significance level is $\alpha$ and you perform $n$ tests, independent or not, and each individual test is done at a significance of $\alpha/n$ then the familywise error for all the tests will be controled at the level $\alpha$, meaning that the probability of a type I error at the family level will be lower than or equal to $\alpha$.

So it may be that the correction is ''conservative'', i.e. the familywise type I error could be stricly lower than $\alpha$, or, so to say, ''you make too few'' type I errors.

At first glance this does not seem like a problem: who would one have a problem with having too few (type I) errors ?

Now there is a trade-off between the power of a test and the probability of a type I error; the lower the probability of a type I error, the higher the probability of a type II error and thus the lower the power of a test.

So if the Bonferroni correction is conservative, it will still control the familywise error at the level $\alpha$ but at a levem strictly lower than $\alpha$. As a too low type I error probability implies a loss of power, it follows that, in cases where the Bonferroni correction is conservative, you loose power !

It can be shown that the Bonferroni correction is conservative when the tests are dependent.

To conclude: you do not need independence for applying Bonferroni, it will still control the familywise error, but in the case of dependence between tests it will be conservative and in that case, even if the familywise type I error is controled, this results in a loss of power.

Note: The Holm procedure controls the type I error familywise in the same way as Bonferroni, the Holm procedure will also be conservative for dependent tests, but less (or equally) conservative as Bonferoni.

Note: The Sidak correction assumes independence.

EDIT 21-09-2016

A. FWER is applicable to your case

FWER is needed whenever you use one and the same sample to test a famliy of hypothesis. This is the case for you e.g. if you want to show that the classes perform different on math, you will have to test three hypothesises, i.e.

$H_0^{(1)}: \mu_{Am}=\mu_{Bm}$ versus $H_1^{(1)}: \mu_{Am} \ne \mu_{Bm}$
$H_0^{(2)}: \mu_{Am}=\mu_{Cm}$ versus $H_1^{(2)}: \mu_{Am} \ne \mu_{Cm}$
$H_0^{(3)}: \mu_{Bm}=\mu_{Cm}$ versus $H_1^{(3)}: \mu_{Bm} \ne \mu_{Cm}$

Where $\mu_{ct}$ is the mean score of class $c$ on test $t$, so e.g. $\mu_{Am}$ is the mean of class $A$ on maths.

There is no doubt that this is a Family of three hypothesises. If you perform each of these three hypothesises at a significance level of $\alpha$ then your type I error will be larger than $\alpha$, so in order to control type I error ''familywise'' you will have to do ''something'' to reduce it to the level of $\alpha$.

Bonferroni is one possibility, but Bonferroni is about controlling the familywise error rate, so there is no doubt that it is applicable to your case. A nice introduction can be found at this link

B. Detailed analysis of your test

In your comment below you are more specific about what you want to do, I cite your comment:

You say: ''What I want to know is, for example, class A does better on math than B does. I will not conclude like class A is better at studying than B is because A got higher scores on at least one subject (either of the subjects or maybe both). Is this right?''

This reminds me at a discussion I had with @amoeba, @Wayne and @Anoldmaninthesea in this post What's wrong with ''multiple testing correction'' compared to ''joint tests''?.

My point is that you must precisely define what you want to test. .

If you want to check whether the three classes perform differently on math, then you should test $H_0: \mu_{Am}=\mu_{Bm}=\mu_{cm}$ versus $H_1: \mu_{Am} \ne \mu_{Bm} \text{ or } \mu_{Am} \ne \mu_{Cm} \text{ or } \mu_{Bm} \ne \mu_{Cm}$. If you know the joint distribution of the class scores on math, then you can do a joint test (see What's wrong with ''multiple testing correction'' compared to ''joint tests''?).

If you do not know that joint distribution then you can do a family of tests, if you control type I error rate !

The family of tests that you can perform is the three tests I mentioned supra. If you want to control FWER at the level $\alpha$ then you should do a correction like e.g. Bonferroni.

However, according to your comment that I cited supra, you don't want to test whether the classes perform differently on math, but whether the classes perform differently in studying better, meaning that they perform better on math OR on philosophy. This implies a different test: $H_0: \mu_{Am}=\mu_{Bm}=\mu_{Cm} \text{ AND }\mu_{Ap}=\mu_{Bp}=\mu_{Cp}$ versus .... (the opposite).

This can be replaced by a family of six hypothesises. If you have a family of six hypothesises then you should divide $\alpha$ by six, as @Björn said in his comment below your question. However, if there is dependence amongst the tests, then dividing by six will lead to a conservative FWER control and to loss of power as I explained supra.

Why is that ? Well I think that is the reason for your question. If you would know the dependence between the results of math and philosophy and assume it is perfect, i.e. if the scores are between 0 and 10, then assume that the points of philosophy are 10-the points of math, so assume that there is perfect dependence.
In that case I know that if the classes score differently on maths then they will also score differently on philosophy (because of the dependence assumed), so I only have to do the test for maths (and this is a family of three tests).

If the dependence is not perfect, then you will have something in the middle.

But what holds is that, even in the case of dependence, the Bonferroni correction controls FWER at the level $\alpha$, however, if there is dependence then this will result in a loss of power (meaning that you will reject too few null hypothesises if you divide by six).

I googled familywise error (I did not know this word). I do not think it is applied to my case. What I want to know is, for example, class A does better on math than B does. I will not conclude like class A is better at studying than B is because A got higher scores on at least one subject (either of the subjects or maybe both). Is this right? — , Sep 21 '16 at 00:01
You're case is surely about the Familywise Error rate (FWER), no doubt about that. Bonferroni is a special case of FWER control, and as I say in my answer, it is a very conservative one in many cases. I will edit my answer. — , Sep 21 '16 at 05:41
"However, according to your comment that I cited supra, you don't... (from your answer)" is the opposite. I have two questions. One is whether the classes perform differently on math (and hopefully which class does better). The other is whether the classes perform differently on philosophy. Then I have no reason to compare the means of math scores and the means of phil scores. I understand I compare the scores of math test with one another, meaning do t-test 3 times, for example, so FWER is applicable here. Then, I will do the same thing to phil scores but for answering a different question. — , Sep 21 '16 at 07:25
"If you want to check whether the three classes perform differently on math..." is what I want to do. Also do the same test using phil scores. I will never test if $\mu_{Ap} = \mu_{Bm}$ is true or not. So my null hypothesis is $\mu_{Am} = \mu_{Bm}= \mu_{Cm}$ and $\mu_{Ap} = \mu_{Bp}= \mu_{Cp}$ — , Sep 21 '16 at 07:36
Good point, but do you thnik that changes anything to the rest of the exlanation ? You would also (assuming independence) transform that into six tests ? — , Sep 21 '16 at 08:55
This will be like I make a conclusion about math scores in one section and make a conclusion about phil scores in another section. So I do have 6 tests but it is a pair of 3 tests on math and 3 tests on phil. My opinion is using $\alpha/3$ instead of $\alpha/6$ should be fine since the conclusion on math and the one on phil are completely separated. I am questioning if $\alpha/3$ is still fine when there is a correlation between math scores and phil scores. — , Sep 21 '16 at 09:14
What is your thesis: that the classes are different in studying or that they differ on maths ? — , Sep 21 '16 at 09:52
I have two theses: they differ on math and they differ on phil. Whether both of them, either of them or neither of them are true does not matter. I want to discuss separately and never at the same time. — , Sep 21 '16 at 10:18
Let us [continue this discussion in chat](http://chat.stackexchange.com/rooms/45695/discussion-between-nickel-and-fcop). — , Sep 21 '16 at 11:28
You say ''... should be fine since the conclusion on math and the one on phil are completely separated. I am questioning if $\alpha/3$ is still fine when there is a correlation between math scores and phil scores'', but if math and phil are completely separated, how can you then talk about correlation between them ? What exactly would you like to conclude from your data ? — , Sep 21 '16 at 16:11
Even though I mention the correlation, my null hypothesis is still $\mu_{Am}=\mu_{Bm}=\mu_{Cm}$ and $\mu_{Ap}=\mu_{Bp}=\mu_{Cp}$ but not $\mu_{Am}=\mu_{Bm}=\mu_{Cm}=\mu_{Ap}=\mu_{Bp}=\mu_{Cp}$. This is because it may be useful to know the correlation between math scores and phil scores, but it will NOT be helpful to know the means of different tests are different, such as $\mu_{Am}=\mu_{Bp}$. — , Sep 22 '16 at 00:20
So if your $H_0$ is that $\mu_{Am}=\mu_{Bm}=\mu_{Cm} \text{ AND } \mu_{Ap}=\mu_{Bp}=\mu_{Cp}$, then $H0$ is false when either $\mu_{Am} \ne \mu_{Bm}$ OR $\mu_{Am} \ne \mu_{Cm}$ OR $\mu_{Bm} \ne \mu_{Cm}$ OR $\mu_{Ap} \ne \mu_{Bp}$ OR $\mu_{Ap} \ne \mu_{Cp}$ OR $\mu_{Bp} \ne \mu_{Cp}$ ? — , Sep 22 '16 at 06:58
It should be like $H_{0m}: \mu_{Am} = \mu_{Bm} = \mu_{Cm} = $ and $H_{0p}: \mu_{Ap} = \mu_{Bp} = \mu_{Cp} = $ because I have two null hypotheses. — , Sep 22 '16 at 14:40
Yes but how will you test $H_{0m}$ in practise, by splitting it into 3 tests? — , Sep 22 '16 at 14:45
@Nickel I edited my answer and changed $H_0: \mu_{Am}=\mu_{Bm}=\mu_{Cm} \text{ AND }\mu_{Ap}=\mu_{Bp}=\mu_{Cp}$ versus .... (the opposite), the change is the ''AND'', if you transform this to univariate tests then I think you will get six tests, so you divide by (six), see my answer (at the end) for the possible consequences. In particular if the tests are dependent then you may loose power which means that you will reject too few $H_0$s. Also note that the Bonferroni correction guarantees FWER-control, so it is at the level of the individual hypothesises — , Sep 23 '16 at 07:42

Stefan · Answer 2 · 2016-09-20T03:42:49.983

To me this sounds like a 2-Way ANOVA situation (if I understand your question correctly). You have class and subject as factors and score as the outcome or dependent variable. Below is an example of what I would do, incl. an example dataset. The analysis has been done using R but the procedure is the same with any other software. However, note that if you are doing t-tests / ANOVA et al., the variance in your residuals has to be roughly equal (you can and should assess this graphically) as well as normally distributed.

Build example dataset (since it uses the sample() function, values won't be the same):

scores <- data.frame(class=rep(c("A","B","C"), each=12),
                     subject=(rep(c("math","phil"), each=6, 6)),
                     score=sample(c(0:100), size=36, replace=T))

Look at data table (is this what you have in principle?):

> scores
   class subject score
1      A    math    40
2      A    math     8
3      A    math    27
4      A    math    94
5      A    math    31
6      A    math    19
7      A    phil    19
8      A    phil    69
9      A    phil    58
10     A    phil    87
11     A    phil    58
12     A    phil    53
13     B    math    61
14     B    math     1
15     B    math    42
16     B    math    55
17     B    math     8
18     B    math    76
19     B    phil    93
20     B    phil    41
21     B    phil    78
22     B    phil     3
23     B    phil    72
24     B    phil    72
25     C    math    94
26     C    math    98
27     C    math    58
28     C    math    52
29     C    math    78
30     C    math    61
31     C    phil     4
32     C    phil    41
33     C    phil    33
34     C    phil     5
35     C    phil    98
36     C    phil     7
37     A    math    40
38     A    math     8
39     A    math    27
40     A    math    94
41     A    math    31
42     A    math    19
43     A    phil    19
44     A    phil    69
45     A    phil    58
46     A    phil    87
47     A    phil    58
48     A    phil    53
49     B    math    61
50     B    math     1
51     B    math    42
52     B    math    55
53     B    math     8
54     B    math    76
55     B    phil    93
56     B    phil    41
57     B    phil    78
58     B    phil     3
59     B    phil    72
60     B    phil    72
61     C    math    94
62     C    math    98
63     C    math    58
64     C    math    52
65     C    math    78
66     C    math    61
67     C    phil     4
68     C    phil    41
69     C    phil    33
70     C    phil     5
71     C    phil    98
72     C    phil     7

Ran ANOVA for main factors and interaction (however what exactly you want to specify depends on your question) and check output:

fit <- aov(score ~ class + subject + class:subject, data=scores)
> summary(fit)

Df Sum Sq Mean Sq F value  Pr(>F)    
class          2    367     183   0.239 0.78823    
subject        1      8       8   0.010 0.91904    
class:subject  2  15507    7753  10.091 0.00015 ***
  Residuals     66  50712     768                    
---
  Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

In this case, the interaction is significant and neither of the main effects is. You could follow up with multiple mean comparisons (e.g. Tukey's HSD test) for the significant interaction:

> TukeyHSD(fit, "class:subject")

  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = score ~ class + subject + class:subject, data = scores)

$`class:subject`
                    diff         lwr       upr     p adj
B:math-A:math   4.000000 -29.2146663 37.214666 0.9992416
C:math-A:math  37.000000   3.7853337 70.214666 0.0203469
A:phil-A:math  20.833333 -12.3813330 54.048000 0.4472734
B:phil-A:math  23.333333  -9.8813330 56.548000 0.3198057
C:phil-A:math  -5.166667 -38.3813330 28.048000 0.9974039
C:math-B:math  33.000000  -0.2146663 66.214666 0.0524698
A:phil-B:math  16.833333 -16.3813330 50.048000 0.6733007
B:phil-B:math  19.333333 -13.8813330 52.548000 0.5312747
C:phil-B:math  -9.166667 -42.3813330 24.048000 0.9647034
A:phil-C:math -16.166667 -49.3813330 17.048000 0.7096646
B:phil-C:math -13.666667 -46.8813330 19.548000 0.8315925
C:phil-C:math -42.166667 -75.3813330 -8.952000 0.0052198
B:phil-A:phil   2.500000 -30.7146663 35.714666 0.9999243
C:phil-A:phil -26.000000 -59.2146663  7.214666 0.2097783
C:phil-B:phil -28.500000 -61.7146663  4.714666 0.1336128

This tells me that there is a significant difference in math scores between class C and A as well as a significant difference between math and philosophy scores in class C. Now it's getting interesting because you have to speculate why this may be the case :D

There is also plenty of information regarding these analyses here on Cross Validated and you should have a look at them.

If I want to focus on each subject (ignore the interaction) and compare A:math, B:math and C:math, and A:phil, B:phil and C:phil separately, can I use TukeyHSD twice? That is, I have two TukeyHSD tests. One considers math scores only and compare different classes, and the other one only considers philosophy scores. — , Sep 20 '16 at 03:58
@Stefan If one looks at main effect and the interaction, does this not require a multiple testing correction? — Björn, Sep 20 '16 at 05:18
@Nickel If the interaction is significant, you shouldn't ignore it. Actually the interaction (in my example above) will answer the comparisons you specify in the comment. E.g. B:math - A:math (p=0.999); C:math - A:math (p=0.02) and C:math - B:math (p=0.05). However, you have to give people more information (data table) and ideally a reproducible example, otherwise it may turn into a guessing game. — Stefan, Sep 20 '16 at 13:30
@Björn Not sure if I understand what you mean... If I do an ANOVA, I don't need to adjust unless I follow up with multiple mean comparisons after the ANOVA, no? Maybe you can expand a little more? — Stefan, Sep 20 '16 at 13:33
@Stefan Sorry, I was confused. You are right, I need A:math-B:math, B:math-C:math and C:math-A:math (pairs of same subject and different classes), but I do not need the others A:phil-B:math, A:math-A:phil (trying to avoid comparisons between different subjects). I am afraid the comparisons I do not want to make decrease the statistical power to find the significance. Is it possible to choose what comparisons I make? I have a table. Want to show the results of statistical tests at the same time. — , Sep 20 '16 at 13:43
@Stefan Were you not looking at a p-value for both the main effect and the interaction (and to then proceeding, if either is significant)? If so, that would need a multiplciity adjustment (you are doing two tests), if you want to control the type I error rate. If you instead only look at the global null hypothesis, then yes, I agree you would not need an adjustment (but then need to figure out later whether it is the main effects, or some interactions that matter and for that you'd again need a multiplicity adjustment, if you want to control the type I error rate). — Björn, Sep 20 '16 at 17:57
@Björn In my example the interaction is significant. Then I wanted to know where the differences in means are. I decided to do this with the Tukey procedure which adjust for multiple mean comparisons, i.e. `TukeyHSD(fit, "class:subject")`. This shows me only the output for the interaction (to save space). In case you want to see the main effects too, you simply do `TukeyHSD(fit)` in this case. You are saying I need yet another adjustment? — Stefan, Sep 20 '16 at 19:46
@Nickel No worries, it's gets confusing very quickly. You can define specific comparisons before the analysis is done, i.e. a priori. If this is not the case, you cannot just pick whichever comparisons you like or else you would increase the chance of finding false positives (Type 1 error). Now, if you have significant main effect(s) and/or interaction, you can follow up with a post-hoc test to see where the differences in means are. Depending on your research objectives you can choose between a variety of tests, e.g. Tukey's procedure. — Stefan, Sep 20 '16 at 19:51
@Stefan For example, the mean of the math scores was obviously very low compare to the mean of phil scores. In this case, there is no point in testing means of tests on different subjects. Do I still have to include these comparisons in TukeyHSD? I simply want to test these subjects separately. — , Sep 21 '16 at 00:46
@Nickel Yes that's what I would do given the information I have from you. There might be other ways. — Stefan, Sep 21 '16 at 04:14
@Björn That's interesting and makes sense, however I have never really come across this yet ... Do you have any references where this is mentioned and/or discussed? So say if you have a 2-way ANOVA and you look at the main effects and interactions, you use $alpha/3$ to check whether or not significant? And only THEN proceeding to multiple mean comparisons for those that are significant? — Stefan, Sep 23 '16 at 23:41
@Stefan If you look in any reference about ANOVA, it will say that you look at the F-test for *the* (i.e. one single) global null hypothesis. If what comparisons you look at depends on what effects are significant that is not an obvious option any longer and what you describe with $\alpha/3$ should be okay (seems like it should be a valid closed testing procedure). — Björn, Sep 24 '16 at 05:41

score 1 · Answer 3 · answered Sep 23 '16 at 07:30

@Stefan If you do a single global test, ANOVA needs no adjustment. However, if you look at the p-values for the interaction and the two main effects (or just one of them), you are not doing a single global test, you are doing two or three or so (depending on exactly what you do). Obviously they are somewhat correlated, but they are not a single test.

Can Bonferroni be applied for dependent multiple tests?

3 Answers3

EDIT 21-09-2016

Linked