It's all in the family; but do we include the in-laws too?

Question

Suppose I have an experiment with two or more factors. An overall ANOVA is constructed, and then we follow-up with two or more sets of post hoc tests, say multiple comparisons. My question is about how big---and how many---families should be used as the basis for multiplicity adjustments of these post hoc tests.

An example is the warp-breaks dataset from Tukey's book on EDA. There are two factors: wool (at two levels) and tension (at three levels). The ANOVA table is:

Source       Df Sum Sq Mean Sq F value    Pr(>F)    
wool          1  450.7  450.67  3.7653 0.0582130
tension       2 2034.3 1017.13  8.4980 0.0006926
wool:tension  2 1002.8  501.39  4.1891 0.0210442
Residuals    48 5745.1  119.69

Clearly, the interaction is needed in the model. So we decide to do comparisons of the levels of each factor, holding the other factor fixed. The results are below, with some annotations to be referred to later:

*** Pairwise comparisons of tension for each wool ***
*** All combined: Family T ***

wool = A:   *** Family T|A ***
 contrast   estimate       SE df t.ratio
 L - M    20.5555556 5.157299 48   3.986
 L - H    20.0000000 5.157299 48   3.878
 M - H    -0.5555556 5.157299 48  -0.108

wool = B:   *** Family T|B ***
 contrast   estimate       SE df t.ratio
 L - M    -0.5555556 5.157299 48  -0.108
 L - H     9.4444444 5.157299 48   1.831
 M - H    10.0000000 5.157299 48   1.939


*** Comparison of wool for each tension ***
*** All combined: Family W ***

tension = L:   *** Family W|L ***
 contrast  estimate       SE df t.ratio
 A - B    16.333333 5.157299 48   3.167

tension = M:   *** Family W|M ***
 contrast  estimate       SE df t.ratio
 A - B    -4.777778 5.157299 48  -0.926

tension = H:   *** Family W|H ***
 contrast  estimate       SE df t.ratio
 A - B     5.777778 5.157299 48   1.120

I think there are different practices out there, and I wonder which are most common, and what arguments people would make for or against each approach. In computing adjusted $P$ values, should we do multiplicity adjustments for...

each of the five smallest families (T|A, T|B, ..., W|H) separately? (Note - the last 3 families have only one test so there would be no multiplicity adjustment for those)
each of the larger families (T, with 6 tests and W, with 3 tests) separately?
all $6+3=9$ tests considered as one big family?

I'm interested both in what people usually do (even if they haven't thought much about it) and why (if they have). A couple of things I might mention are:

There are 3 $F$ tests in the ANOVA table. I don't recall seeing anyone consider a multiplicity adjustment on ANOVA tests. If that's the case, and you recommend option (3), are you being inconsistent?
If we had done a somewhat smaller experiment where all the tests are less powerful, it's possible the interaction would not have been significant, leading to a much smaller number of post hoc comparisons of marginal means only. Moreover, the marginal means could well have smaller SEs than the cell means do in the larger experiment. If, in addition, the multiplicity adjustment is less conservative, we could have more "significant" results with less data than we'd have with more data.

Interested in seeing what people have to say...

score 2 · Answer 1 · edited Apr 13 '17 at 12:44

No one's answered yet, so I'll take a crack at this.

It's my opinion (and I would love to hear other's thoughts) that you should be adjusting for the full 9 tests in this case. Assuming we're using family-wise error rate correction,

We are simultaneously drawing conclusions from all 9 tests at once. I.e. scanning down the list and seeing to find anything significant.
To be able to do this, we are considering an overall family-wise error rate of 5%. The alternative would be to individually correct the groups to a 5% FWER. This would mean that when interpreting, we could not interpret the tests together, and would rather have to look at the first 6 tests and think that there's a 5% chance of a false positive, then subsequently examine each of the further tests in turn knowing that there is a 5% chance of a false positive for each group. IMO the utility of multiple testing correction is that we are able to simultaneously draw inference from multiple tests at once. It seems more logical that we should look at all 9 tests and know there's a 5% chance of a false positive, rather than having to examine them separately, akin to not correcting at all.
The issue of adjusting for the three $F$-tests in the ANOVA is interesting, but in my opinion only relevant if you plan to do some model selection in which you only accept significant predictors. This might be a good read, specifically the conclusion is a very succinct and excellent read. I stole that link from this question.
Your point about the inclusion of interaction effects is interesting, and I think you could define that as model selection. Would you have included the interaction effects if they were significant? In this case perhaps the $F$ statistics in the original ANOVA should have been adjusted in order to facilitate selection of significant predictors.

Overall I think that if you are drawing simultaneous inference from a group, you must consider each test in that group for correction. Otherwise the standard understanding of controlled group error rate doesn't hold up, and it's quite difficult to conceptually keep track of what has been adjusted and what hasn't. Much better, in my opinion, to hold all tests accountable and hold the family-wise error rate at a given threshold.

If you have any rebuttals, I would love to hear them, and I'm sure some people will disagree with some things in here. Very interested to hear other's thoughts.

Thanks. Well thought-out. Side question: is it possible to get SAS to do this? I don't think so but there's a lot I don't know about SAS. It's relevant because I think this type of adjustment is seldom used in practice. — Russ Lenth, Aug 14 '15 at 13:47
Unfortunately I don't know that much about SAS, sorry @rvl. Maybe someone else will see this and help out. I hope you get some more people chiming in for this issue, it's a very good question that people don't really think about that often. — Chris C, Aug 14 '15 at 13:51
that's fine - I was just musing about what is actually possible to do easily with existing software. If the consensus comes down to option 3, we need software support for it! — Russ Lenth, Aug 14 '15 at 19:20
... but now it can be done in R. See the new answer I posted in the related question, http://stats.stackexchange.com/questions/165125/lsmeans-r-adjust-for-multiple-comparisons-with-interaction-terms/167228#167228. That question is what got me thinking about this. — Russ Lenth, Aug 15 '15 at 01:06
Very cool! Are you the maintainer of `lsmeans`? That was a lot of work for that question! — Chris C, Aug 15 '15 at 03:32
Yes I am. I figured it was worth doing that to provide for different multiplicity adjustments — Russ Lenth, Aug 15 '15 at 12:57

It's all in the family; but do we include the in-laws too?

1 Answers1

Linked