I have multiple groups of data (20+ groups) and test the null hypothesis that they have the same mean. How do I proceed after the null hypothesis is rejected? What is the standard method for selecting the groups that do have the same mean?
The common wisdom says that in this case one has to do pairwise t-tests. It is hard to believe that nothing more efficient has been invented. In addition, pairwise testing has many deficiencies, e.g.
- There is no clear algorithm/protocol for separating the groups that may have the same mean from those that are off. (Visual inspection may do for one experiment, but what if you have to repeat this analysis hundreds of times with different data?)
- Even with such a protocol, selecting groups that likely have the same mean on the basis of pairwise t-tests will lead to accumulation of errors.
On the other hand, pairwise testing has the advantage that we have efficient test of confirming the null hypothesis for two groups: TOST equivalence test.
What are we looking for?
Here is what could be meaningful answers to this question (in my opinion):
- A method for separating the groups that may have common mean from the rest. If the method is based on pairwise comparison (rather than the batch testing) it probably includes some clustering procedure.
- Equivalence test for the groups with common mean, i.e. a test for affirming the null hypothesis rather than just not rejecting it.
- Equivalent Bayesian procedures for the above tasks.
- Other suggestions about the things that could be done in ANOVA setting after rejecting the null hypothesis (perhaps looking for the subset of groups with a common mean is not the only interesting goal.)
Important
- References to textbooks and research papers are encouraged.
- Math and algorithms are better than code
- Preference is for scalable methods - that can be used with a big number of groups and many times (e.g. visual inspection won't do).
Related questions
What do "single-step" and "multi-step" mean in post-hoc testing of ANOVAs?
Two-sample $t$-test vs Tukey's method
Why are there not obvious improvements over Tukey's method?
Remark about pairwise comparisons
Since multiple users (correctly) noted that making pairwise comparisons for 20 groups is not that computationally hard (it involves $k(k-1)$ comparisons with $k=20$), I would like to explain again why this doesn't solve the problem (apart from obvious intellectual curiosity about dealing with $k=200, 2000, etc.$).
Suppose you have done all pairwise comparisons (t-tests and equivalence tests), and now you are staring at a 20-by-20 table of p-values. What do you do next? In some cases the outlying groups are obvious - but those are the cases where we do not really need statistics.
In the general case you need an algorithm for selecting the groups with common mean. Moreover, it is unlikely that you can do it on the basis of pairwise comparisons only: group A may be not statistically different from B and C, but B and C may be different among themselves.
The brute force approach would be doing ANOVA F-test for all possible combinations of $m$ groups, where m runs from 2 to k-1. This leaves us with \begin{equation} \sum_{m=2}^{k-1} {k \choose m} \sim 2^k \end{equation} tests, which looks more daunting than ${k \choose 2}$ pairwise tests. It is of course an exaggeration, but it sets the upper boundary and hopefully clarifies, why I would suspect existence of a standard method for finding the subset of groups with the common mean.
Multiple comparisons
@SalMangiafico has brought to my attention the compact letter display approach. It seems to be more a visualization than inference technique, yet it led me to this article, which contains some relevant references:
- Hochberg, Y., Tamhane, A. C., 1987. Multiple Comparison Procedures. Wiley.
- Hsu, J. C., 1996. Multiple Comparisons: Theory and Methods.Chap-man&Hall.
- Shaffer, J. P., 1995. Multiple hypothesis testing. Annual Review of Psychology46, 561–584. Schaffer
I am not yet sure whether these books contain the answers to my questions: it seems that the field deals more than with errors introduced by multiple comparisons than with the questions that I raised above. If somebody knowledgeable could write here or recommend a brief review on teh subject it will be greatly appreciated.