ANOVA: life after rejecting the null hypothesis

Question

I have multiple groups of data (20+ groups) and test the null hypothesis that they have the same mean. How do I proceed after the null hypothesis is rejected? What is the standard method for selecting the groups that do have the same mean?

The common wisdom says that in this case one has to do pairwise t-tests. It is hard to believe that nothing more efficient has been invented. In addition, pairwise testing has many deficiencies, e.g.

There is no clear algorithm/protocol for separating the groups that may have the same mean from those that are off. (Visual inspection may do for one experiment, but what if you have to repeat this analysis hundreds of times with different data?)
Even with such a protocol, selecting groups that likely have the same mean on the basis of pairwise t-tests will lead to accumulation of errors.

On the other hand, pairwise testing has the advantage that we have efficient test of confirming the null hypothesis for two groups: TOST equivalence test.

What are we looking for?

Here is what could be meaningful answers to this question (in my opinion):

A method for separating the groups that may have common mean from the rest. If the method is based on pairwise comparison (rather than the batch testing) it probably includes some clustering procedure.
Equivalence test for the groups with common mean, i.e. a test for affirming the null hypothesis rather than just not rejecting it.
Equivalent Bayesian procedures for the above tasks.
Other suggestions about the things that could be done in ANOVA setting after rejecting the null hypothesis (perhaps looking for the subset of groups with a common mean is not the only interesting goal.)

Important

References to textbooks and research papers are encouraged.
Math and algorithms are better than code
Preference is for scalable methods - that can be used with a big number of groups and many times (e.g. visual inspection won't do).

Related questions

What do "single-step" and "multi-step" mean in post-hoc testing of ANOVAs?

Two-sample $t$-test vs Tukey's method

Why are there not obvious improvements over Tukey's method?

Remark about pairwise comparisons

Since multiple users (correctly) noted that making pairwise comparisons for 20 groups is not that computationally hard (it involves $k(k-1)$ comparisons with $k=20$), I would like to explain again why this doesn't solve the problem (apart from obvious intellectual curiosity about dealing with $k=200, 2000, etc.$).

Suppose you have done all pairwise comparisons (t-tests and equivalence tests), and now you are staring at a 20-by-20 table of p-values. What do you do next? In some cases the outlying groups are obvious - but those are the cases where we do not really need statistics.

In the general case you need an algorithm for selecting the groups with common mean. Moreover, it is unlikely that you can do it on the basis of pairwise comparisons only: group A may be not statistically different from B and C, but B and C may be different among themselves.

The brute force approach would be doing ANOVA F-test for all possible combinations of $m$ groups, where m runs from 2 to k-1. This leaves us with \begin{equation} \sum_{m=2}^{k-1} {k \choose m} \sim 2^k \end{equation} tests, which looks more daunting than ${k \choose 2}$ pairwise tests. It is of course an exaggeration, but it sets the upper boundary and hopefully clarifies, why I would suspect existence of a standard method for finding the subset of groups with the common mean.

Multiple comparisons

@SalMangiafico has brought to my attention the compact letter display approach. It seems to be more a visualization than inference technique, yet it led me to this article, which contains some relevant references:

Hochberg, Y., Tamhane, A. C., 1987. Multiple Comparison Procedures. Wiley.
Hsu, J. C., 1996. Multiple Comparisons: Theory and Methods.Chap-man&Hall.
Shaffer, J. P., 1995. Multiple hypothesis testing. Annual Review of Psychology46, 561–584. Schaffer

I am not yet sure whether these books contain the answers to my questions: it seems that the field deals more than with errors introduced by multiple comparisons than with the questions that I raised above. If somebody knowledgeable could write here or recommend a brief review on teh subject it will be greatly appreciated.

Your question and title don’t quite match. Are you talking about pairwise comparisons (A to B, A to C, A to D, B to C, B to D, and C to D)? If so, your mention of “outlier” has me wondering if the differences must be large (not just statistically significant) in order for you to care. Also, do you have some theorized mean of the group, such as an expectation that all have a mean of 6? — Dave, Nov 28 '19 at 14:26
Regarding the pairwise comparisons: 1. I am interested if there are other ways to proceed than pairwise comparisons. 2. If I do pairwise comparisons: what are efficient ways to do so (we are talking about 20+ groups) There is no theorized mean - it has to be inferred from the data. — Roger Vadim, Nov 28 '19 at 14:30
Even if you have 50 groups, that’s only about 1000 pair wise tests. Software will handle that quickly. I would say to loop over your groups and do t-tests of each pair, creating a list of the pairs that reject equality and a list of pairs that fail to reject equality. Where it gets iffy is in controlling family wise error rate. The simple Bonferroni correction to $\alpha$ is going to sap away your power to reject. There are more advanced methods, like Benjamini-Hochberg. R implements these with p.adjust: https://www.rdocumentation.org/packages/stats/versions/3.6.1/topics/p.adjust. — Dave, Nov 28 '19 at 14:42
I think the standard way is to look at a picture of the data. — Vincent, Nov 28 '19 at 14:44
@Vincent - it works; if you have to do it once. But if I have to repeat the same procedure for several hundred datasets? — Roger Vadim, Nov 28 '19 at 14:45
Just do post-hoc tests: https://stats.idre.ucla.edu/r/faq/how-can-i-do-post-hoc-pairwise-comparisons-in-r/ — German Demidov, Nov 28 '19 at 14:47
@Dave - ultimately one would like to separate the groups that do have similar means from the rest. Doing it via pairwise t-tests requires a clear algorithms for doing so, and likely leads to accumulating errors. Excluding some groups and doing ANOVA again and again is combinatorially complicated. — Roger Vadim, Nov 28 '19 at 14:48
"What is the standard method for selecting the groups that do have the same mean?" - statistical tests can easily disprove your null hypothesis of equality of means, but tests have little to offer in order to find where null is true. — German Demidov, Nov 28 '19 at 14:50
@GermanDemidov That's precisely why I ask the question! In case of only two groups one could supplement the t-test with equivalence TOST. — Roger Vadim, Nov 28 '19 at 14:52
What would you hope to get out of TOST that the t-test doesn’t get you? Also, your comment about separating the groups into groups with similar means makes me think that some kind of cluster analysis might be appropriate. You’re only in one dimension, right? If you plot each mean in a different color on the number line, are there any patterns or clusters? — Dave, Nov 28 '19 at 15:06
T-test allows to reject the null hypothesis that the two means are identical, but not to uphold it. This is why one needs TOST. Cluster analysis may be a good direction. Plotting however is not a practical approach, since I will have to repeat the same analysis hundreds of times. — Roger Vadim, Nov 28 '19 at 15:15
Because I have hundreds of such datasets, for different proteins. — Roger Vadim, Nov 28 '19 at 15:38
If you want to correctly fail to reject the null hypothesis (show equality), you need to have a theorized smallest effect and then do a power analysis. Otherwise, you could do bootstraping or consider calculating Bayesian posterior probability. — Jay Schyler Raadt, Nov 28 '19 at 20:45
Could you give a reference on how this is done on multiple groups? I know TOST for pairwise comparison, but ANOVA seems to pose some combinatorial issues. — Roger Vadim, Nov 28 '19 at 20:47
20 groups is tiny. You surely can afford pairwise computations on summary statistics. — Has QUIT--Anony-Mousse, Dec 01 '19 at 08:05
@Anony-Mousse I added a clarification about why pairwise tests are not enough. — Roger Vadim, Dec 01 '19 at 10:27

Sal Mangiafico · Answer 1 · 2019-11-30T14:41:05.527

I'll try to answer one piece of the question. There is an established protocol for distinguishing similar groups from dissimilar groups which results in a compact letter display. A series of pairwise tests can be reduced to a compact letter display manually, or, for example, with the function cld in the multcomp package in R.

The following example uses a function from the rcompanion package, because I find it a little easier. (With the caveat that I am the author of this package).

Here, groups sharing a letter are not significantly different (alpha = 0.05).

if(!require(rcompanion)){install.packages("rcompanion")}

Data = read.table(header=T, text="
Comparison   p.value  cohen.d
'A - B = 0'  0.20     0.5
'A - C = 0'  0.01     0.9
'A - D = 0'  0.001    1.5
'B - C = 0'  0.20     0.5
'B - D = 0'  0.01     1.1
'C - D = 0'  0.20     0.5
")

library(rcompanion)

cldList(p.value ~ Comparison, data=Data, threshold=0.05)

   ###   Group Letter MonoLetter
   ### 1     A      a        a  
   ### 2     B     ab        ab 
   ### 3     C     bc         bc
   ### 4     D      c          c
   ###
   ### Groups sharing a letter are not significantly different (alpha = 0.05).

This can be a compact way to present the results of many comparisons, and these letters are useful to add to plots. However, there is a movement away from presenting results in this way, because it treats the alpha level as a magic cutoff, whereas presenting the p values themselves gives the reader more information. For example, the essential R package emmeans no longer supports compact letter displays.

Of course, in this discussion, one shouldn't put too much weight on the p values. One could perhaps use an effect size statistic (like Cohen's d) to do a similar procedure.

Here, let's say we are considering two groups to be relatively similar if Cohen's d isn't at least 1.0. We can create a flag for this, and run a similar procedure.

Data$Flag = as.numeric(Data$cohen.d < 1.0)

cldList(Flag ~ Comparison, data=Data, threshold=0)

   ###   Group Letter MonoLetter
   ### 1     A      a         a 
   ### 2     B      a         a 
   ### 3     C     ab         ab
   ### 4     D      b          b
   ### 
   ### Groups sharing a letter have Cohen's *d* < 1.0

Thank you for posting this. I have nothing against you promoting your package... but I am not an R user. More importantly, I would rather see the math and the algorithms rather than a black-box R or Python function. — Roger Vadim, Nov 30 '19 at 14:55
The procedure to produce the compact letter display manually is fairly straightforward, but I'm having trouble finding a procedure online. To give an incomplete sketch: You arrange the groups from highest to lowest. You put an *a* next to Group 1. If Group 2 is not statistically different from Group 1, you put an *a* next to Group 2. You continue comparing groups to Group until you find one that's different. You then move on to comparing groups to Group 2. At the end, you might letters that can be subsumed by another letter, so you can clean this up. — Sal Mangiafico, Nov 30 '19 at 15:08
From highest to lowest mean, or median, or whatever your test is testing for. — Sal Mangiafico, Nov 30 '19 at 15:15
Sounds like an interesting approach. However, I see here a problem with error accumulation: if Group 1 is not statistically different from Group 2, and Group 2 is not statistically different from Group 3, 1 and 3 may be still statistically different. Yet, they have the same letter. — Roger Vadim, Nov 30 '19 at 15:33
"if Group 1 is not statistically different from Group 2, and Group 2 is not statistically different from Group 3, 1 and 3 may be still statistically different. Yet, they have the same letter." In this case, 1 and 3 wouldn't have the same letter. In the example I gave, B is not different from C, C is not different from D, and B is different from D. B and D don't share a letter. — Sal Mangiafico, Nov 30 '19 at 15:38
I dislike clds because they present non-findings as findings. That is, what is highlighted are the groupings, and these are the comparisons that can NOT be judged with any confidence. We have not shown that those means are equal. — Russ Lenth, Nov 30 '19 at 15:53

score 0 · Answer 2 · answered Dec 31 '19 at 11:45

The approaches that I have used to deal with this problem:

Pairwise comparisons There is much literature devoted to the statistical issues involved in post_hoc pairwise testing. In practical terms however pairwise testing itself does not give any answers: you still need some kind of clustering algorithm and post-post-hoc analysis to get sensible results.
Robust regression Robust regression with M-estimators (e.g., Huber or Tukey) works quite well; e.g., in order to determine the common mean. However, it is difficult to judge the errors and the quality of the estimates.
Expectation maximization EM is complimentary to robust regression: it is easy to interpret, but it requires defining a parametric model for the data and the outliers.

ANOVA: life after rejecting the null hypothesis

2 Answers2