Suppose I have an experiment with two or more factors. An overall ANOVA is constructed, and then we follow-up with two or more sets of post hoc tests, say multiple comparisons. My question is about how big---and how many---families should be used as the basis for multiplicity adjustments of these post hoc tests.
An example is the warp-breaks dataset from Tukey's book on EDA. There are two factors: wool
(at two levels) and tension
(at three levels). The ANOVA table is:
Source Df Sum Sq Mean Sq F value Pr(>F)
wool 1 450.7 450.67 3.7653 0.0582130
tension 2 2034.3 1017.13 8.4980 0.0006926
wool:tension 2 1002.8 501.39 4.1891 0.0210442
Residuals 48 5745.1 119.69
Clearly, the interaction is needed in the model. So we decide to do comparisons of the levels of each factor, holding the other factor fixed. The results are below, with some annotations to be referred to later:
*** Pairwise comparisons of tension for each wool ***
*** All combined: Family T ***
wool = A: *** Family T|A ***
contrast estimate SE df t.ratio
L - M 20.5555556 5.157299 48 3.986
L - H 20.0000000 5.157299 48 3.878
M - H -0.5555556 5.157299 48 -0.108
wool = B: *** Family T|B ***
contrast estimate SE df t.ratio
L - M -0.5555556 5.157299 48 -0.108
L - H 9.4444444 5.157299 48 1.831
M - H 10.0000000 5.157299 48 1.939
*** Comparison of wool for each tension ***
*** All combined: Family W ***
tension = L: *** Family W|L ***
contrast estimate SE df t.ratio
A - B 16.333333 5.157299 48 3.167
tension = M: *** Family W|M ***
contrast estimate SE df t.ratio
A - B -4.777778 5.157299 48 -0.926
tension = H: *** Family W|H ***
contrast estimate SE df t.ratio
A - B 5.777778 5.157299 48 1.120
I think there are different practices out there, and I wonder which are most common, and what arguments people would make for or against each approach. In computing adjusted $P$ values, should we do multiplicity adjustments for...
- each of the five smallest families (T|A, T|B, ..., W|H) separately? (Note - the last 3 families have only one test so there would be no multiplicity adjustment for those)
- each of the larger families (T, with 6 tests and W, with 3 tests) separately?
- all $6+3=9$ tests considered as one big family?
I'm interested both in what people usually do (even if they haven't thought much about it) and why (if they have). A couple of things I might mention are:
- There are 3 $F$ tests in the ANOVA table. I don't recall seeing anyone consider a multiplicity adjustment on ANOVA tests. If that's the case, and you recommend option (3), are you being inconsistent?
- If we had done a somewhat smaller experiment where all the tests are less powerful, it's possible the interaction would not have been significant, leading to a much smaller number of post hoc comparisons of marginal means only. Moreover, the marginal means could well have smaller SEs than the cell means do in the larger experiment. If, in addition, the multiplicity adjustment is less conservative, we could have more "significant" results with less data than we'd have with more data.
Interested in seeing what people have to say...