What is more powerful – an ANOVA test or post hoc tests?

Question

From answers on this site I have learned that it is possible to get significant results from multiple comparison tests ("post hoc"), even if ANOVA did not yield a significant result. I believe that the opposite is also possible – significant ANOVA result, and non-significant post-hoc results.

I am not sure that it is possible to answer a general question, but I would still like to ask it – which is generally speaking, more powerful, an ANOVA test or post hoc tests on the same data?

I agree, but I think that nevertheless one can still ask which is more powerful. — Sam, Apr 04 '21 at 08:31
@Sam No, you cannot meaningfully ask which is more powerful, since the definition of **powerful** is $1 - P(\text{Reject H}_{0} | \text{H}_{0} = \text{False})$. The "$|\text{H}_{0} = \text{False}$" part means the ANOVA power ≠ the *post hoc* pairwise test power in the same way that apple ≠ orange. — Alexis, Apr 06 '21 at 00:34
@Alexis The question most certainly is meaningful, indeed it is quite standard. My answer details exactly how the comparison is made and how the posthoc t-tests test (amongst other things) the same null hypothesis as the F-test. — Gordon Smyth, Apr 06 '21 at 01:32
@AdamO If there are $G$ groups and $G-1$ linearly independent contrasts are tested then the posthoc tests do test (amongst other things) the overall F-test null hypothesis of no group differences. — Gordon Smyth, Apr 06 '21 at 01:39
@GordonSmyth Your rather belabored answer (I downvoted) mixes the issues of Reject ANOVA but Not Reject any *post hoc* tests and vice versa (a question which has been asked and answered repeatedly on this site, BTW) with some generalized notion of which test is "more powerful". ANOVA and *post hoc* tests have different null and alternative hypotheses. Since **power** is literally and explicitly defined with respect to a specific null hypothesis and set of alternative hypotheses, you are comparing apples to oranges. — Alexis, Apr 06 '21 at 15:21
@Alexis I have not mixed issues. Yes of course power is defined in terms of specified null and alternative. We all understand that. What you have failed to understand is that a set of postdoc hypotheses that span the space of all contrasts combine to match the null and alternative of the F-test. The F-test null is equal to the intersection of the postdoc nulls in the F-test alternative is equal to the union of the posthoc alternatives. It is most certainly possible to conduct a complete analysis of a oneway anova without conducting the F-test. — Gordon Smyth, Apr 06 '21 at 21:27
@GordonSmyth You have lost me on where you are contrasting "power of ANOVA" by "not conducting the F-test," and I think you are contorting the meaning of "power of a test." (Aside: why would you use Holm's method, rather than the Benjamini-Hochberg FDR adjustment? Latter is strictly more powerful than the former., is adaptive and scales.) — Alexis, Apr 06 '21 at 23:16
@Alexis I am not changing or contorting the mean of statistical power in any way. There is only one definition of statistical power. As for not doing the F-test: we are comparing two procedures, one is to conduct the F-test and not t-tests, the second is to conduct a series of anova t-tests but not the F-test. — Gordon Smyth, Apr 07 '21 at 04:51
@Alexis The fact that the minimum of a number of adjusted p-values is a valid p-value for testing the intersection of null hypotheses is pretty much the definition of (weak) familywise error rate control. Yes, I could have suggested a multiple testing procedure that only offers weak familywise error rate control such as Simes method. That would be more powerful than Holm's for the overall null hypothesis but would not then offer strong error rate control for the individual tests. — Gordon Smyth, Apr 07 '21 at 05:01
@Alexis There are other more powerful and more specialized multiple testing procedures for the oneway layout (discussed in Peter Westfall(aka BigBendRegion)'s book), but I answered the question in terms of Holm's to keep it simple. Holm's method is the simplest and most flexible that offers strong familywise error rate control. — Gordon Smyth, Apr 07 '21 at 05:05
@GordonSmyth Thank you for entertaining my question about Holm's method. — Alexis, Apr 07 '21 at 16:17

Gordon Smyth · Accepted Answer · 2021-04-20T06:56:54.937

To make anova and posthoc tests comparable, you need to be conducting posthoc tests for at least $G-1$ contrasts (where $G$ is the number of groups) such that the contrasts span the space of all possible contrasts. In that case, accepting all the contrast null hypotheses implies that the true group means are all equal, equivalent to the anova F-test null hypothesis.

Given this assumption, which is more powerful depends on the configuration of true group means and on the specific post-hoc tests you plan to conduct.

Let's say you are doing a oneway anova with $G$ groups. While there are many ways to conduct post hoc tests, the following strategies are amongst the most common choices:

Perform an overall F-test for differences between the group means.
Perform t-tests for all possible pairwise comparisons. Adjust the p-values for multiple testing using Holm's method.
Choose $G-1$ linearly independent contrasts that correspond to your scientific hypotheses. Conduct a t-test for each contrast and adjust the p-values using Holm's method.

Consider the overall null hypothesis that the true group means are all equal. In approach 1, the null hypothesis is rejected if the F-test p-value is less than $\alpha$, where $\alpha$ is the significance level. In approaches 2 or 3, the null hypothesis of no differences is rejected if any of the adjusted p-values are less than $\alpha$. All three approaches control the type I error rate for this test at $\alpha$.

Approaches 2 and 3 test a number of null hypotheses besides the overall null of no differences, but I am answering your question in terms of the overall null. For approaches 2 and 3, the minimum adjusted p-value is an effective test of the overall null because the intersection of the null hypotheses for the individual t-tests is equal to the overall null hypothesis and the union of t-test alternative hypotheses is equal to the F-test alternative.

The F-statistic can be written as a weighted average of the squared t-statistics from approaches 2 or 3. Hence it works best when all or most of the t-tests contribute meaningfully to the average. For the t-test approaches, the result is driven mainly or entirely by the largest t-statistic. In general, the F-statistic will give a smaller p-value if the individual t-statistics are all similar in size whereas approach 3 will give a smaller p-value if one of the t-statistics is much larger than the others in absolute size.

The F-test is more powerful than the t-tests if the true group means are equally spaced. The contrasts t-tests will be more powerful if one or more of the contrasts match the true pattern of group differences. The pairwise t-tests are generally less powerful than the contrast t-tests (because more $G(G-1)/2$ tests are conducted instead of $G-1$) but still may be more powerful than the F-test if one of the pairwise differences is much larger than the others.

One way to choose the contrasts is to test each group mean vs the average of the other group means. This approach is more powerful than the other approaches for detecting one group separate to the others.

After many years of working in biomedical research, I find that I use the F-test less and less and I use approach 3 more and more. The main trouble with the F-test in practice is that, when the null hypothesis is rejected, it gives no guidance as to which groups means are different. So one has to conduct the t-tests anyway, in which case disagreement between the t-tests and the F-tests becomes an interpretation problem. Assuming that the contrasts are appropriately chosen, approach 3 is statistically more powerful as well as simpler to interpret. Approaches 2 and 3 have the added advantage of the F-test of strong familywise error rate control over all tests conducted.

If the anova is balanced (equal numbers in each group) then one can also consider Tukey's honestly significant differences, which is similar to making all possible pairwise comparisons but more powerful because Tukey's method accounts for the dependencies between the pairwise comparisons. My remarks above apply to any anova, balanced or otherwise.

Balance is not required. Procedures that account for dependencies exactly in unbalanced situations, for any set of contrasts, have been standard for years; see eg the 2010 book "Multiple Comparisons Using R" and references therein. — BigBendRegion, Apr 06 '21 at 01:51

What is more powerful – an ANOVA test or post hoc tests?

1 Answers1