32

Is it possible for one-way (with $N>2$ groups, or "levels") ANOVA to report a significant difference when none of the $N(N-1)/2$ pairwise t-tests does?

In this answer @whuber wrote:

It is well known that a global ANOVA F test can detect a difference of means even in cases where no individual [unadjusted pairwise] t-test of any of the pairs of means will yield a significant result.

so apparently it is possible, but I do not understand how. When does it happen and what the intuition behind such a case would be? Maybe somebody can provide a simple toy example of such a situation?

Some further remarks:

  1. The opposite is clearly possible: overall ANOVA can be non-significant while some of the pairwise t-tests erroneously report significant differences (i.e. those would be false positives).

  2. My question is about standard, non-adjusted for multiple comparisons t-tests. If adjusted tests are used (like e.g. Tukey's HSD procedure), then it is possible that none of them turns out to be significant even though the overall ANOVA is. This is covered here in several questions, e.g. How can I get a significant overall ANOVA but no significant pairwise differences with Tukey's procedure? and Significant ANOVA interaction but non-significant pairwise comparisons.

  3. Update. My question originally referred to the usual two-sample pairwise t-tests. However, as @whuber pointed out in the comments, in the ANOVA context, t-tests are usually understood as post hoc contrasts using the ANOVA estimate of the within-group variance, pooled across all groups (which is not what happens in a two-sample t-test). So there are actually two different versions of my question, and the answer to both of them turns out to be positive. See below.

amoeba
  • 93,463
  • 28
  • 275
  • 317
  • 3
    Your question is covered in many threads: try searching our site on [significant regression](http://stats.stackexchange.com/search?tab=votes&q=regression%20significant). (ANOVA is an application of least squares regression.) For instance, http://stats.stackexchange.com/questions/14500/how-can-a-regression-be-significant-yet-all-predictors-be-non-significant/14528#14528 provides an explicit example and some intuition. Please research these and edit your question, if possible, to distinguish it from previous threads. – whuber Jan 22 '14 at 16:40
  • Thank you, I have not seen that before. However, I have really hard time translating these explanations about multiple regression into the language of ANOVA comparisons. This is of course my own problem, but I would guess that I am not alone, so maybe an answer to my question would still be useful for the community. Here is my confusion: somebody gave an example of regressing weight to left/right shoe sizes (two strongly correlated IVs) => F signif, t not. Very well. Now in ANOVA regression with 3 groups there are 2 *dummy* IVs; they are dummy => always perfectly anticorrelated... And so what? – amoeba Jan 22 '14 at 17:46
  • I'm afraid I don't follow that last remark. First, the issue is not necessarily related to strong correlation in the design matrix. Second, dummies are *not* "perfectly anticorrelated": if they were, the software would have to drop one of them anyway. You perhaps might be referring to [subtler issues in more complex ANOVA models](http://stats.stackexchange.com/questions/18084/collinearity-between-categorical-variables). – whuber Jan 22 '14 at 19:04
  • @amoeba: your dummy variables are negatively correlated. – Michael M Jan 22 '14 at 19:05
  • Do you have a lot of groups? – Jeremy Miles Jan 22 '14 at 19:55
  • Jeremy, I don't have any groups, my question is purely theoretical: I want to understand if such a situation is possible in principle. I guess I would prefer an example with 3 groups, as this is the smallest number when the question still makes sense. – amoeba Jan 22 '14 at 19:57
  • 4
    I take exception to your "further remark" no. 1. Just because you have highly significant pairwise comparisons and a nonsignificant F does not imply that those significant results are false positives. In order to know for sure that something is a false positive, you have to know that there is no difference in the actual means, the mu's. The F statistic is not sacred. In fact, it's not even mandatory. It's most useful for model selection, but beyond that it's hardly informative of what specifically is going on in your data. – Russ Lenth Jul 26 '14 at 02:45

3 Answers3

19

Note: There was something wrong with my original example. I stupidly got caught by R's silent argument recycling. My new example is quite similar to my old one. Hopefully everything is right now.

Here's an example I made that has the ANOVA significant at the 5% level but none of the 6 pairwise comparisons are significant, even at the 5% level.

Here's the data:

g1:  10.71871  10.42931   9.46897   9.87644
g2:  10.64672   9.71863  10.04724  10.32505  10.22259  10.18082  10.76919  10.65447 
g3:  10.90556  10.94722  10.78947  10.96914  10.37724  10.81035  10.79333   9.94447 
g4:  10.81105  10.58746  10.96241  10.59571

enter image description here

Here's the ANOVA:

             Df Sum Sq Mean Sq F value Pr(>F)  
as.factor(g)  3  1.341  0.4469   3.191 0.0458 *
Residuals    20  2.800  0.1400        

Here's the two sample t-test p-values (equal variance assumption):

        g2     g3     g4
 g1   0.4680 0.0543 0.0809 
 g2          0.0550 0.0543 
 g3                 0.8108

With a little more fiddling with group means or individual points, the difference in significance could be made more striking (in that I could make the first p-value smaller and the lowest of the set of six p-values for the t-test higher).

--

Edit: Here's an additional example that was originally generated with noise about a trend, which shows how much better you can do if you move points around a little:

g1:  7.27374 10.31746 10.54047  9.76779
g2: 10.33672 11.33857 10.53057 11.13335 10.42108  9.97780 10.45676 10.16201
g3: 10.13160 10.79660  9.64026 10.74844 10.51241 11.08612 10.58339 10.86740
g4: 10.88055 13.47504 11.87896 10.11403

The F has a p-value below 3% and none of the t's has a p-value below 8%. (For a 3 group example - but with a somewhat larger p-value on the F - omit the second group)

And here's a really simple, if more artificial, example with 3 groups:

g1: 1.0  2.1
g2: 2.15 2.3 3.0 3.7 3.85
g3: 3.9  5.0

(In this case, the largest variance is on the middle group - but because of the larger sample size there, the standard error of the group mean is still smaller)


Multiple comparisons t-tests

whuber suggested I consider the multiple comparisons case. It proves to be quite interesting.

The case for multiple comparisons (all conducted at the original significance level - i.e. without adjusting alpha for multiple comparisons) is somewhat more difficult to achieve, as playing around with larger and smaller variances or more and fewer d.f. in the different groups don't help in the same way as they do with ordinary two-sample t-tests.

However, we do still have the tools of manipulating the number of groups and the significance level; if we choose more groups and smaller significance levels, it again becomes relatively straightforward to identify cases. Here's one:

Take eight groups with $n_i=2$. Define the values in the first four groups to be (2,2.5) and in the last four groups to be (3.5,4), and take $\alpha=0.0025$ (say). Then we have a significant F:

> summary(aov(values~ind,gs2))
            Df Sum Sq Mean Sq F value  Pr(>F)   
ind          7      9   1.286   10.29 0.00191 
Residuals    8      1   0.125                   

Yet the smallest p-value on the pairwise comparisons is not significant that that level:

> with(gs2,pairwise.t.test(values,ind,p.adjust.method="none"))

        Pairwise comparisons using t tests with pooled SD 

data:  values and ind 

   g1     g2     g3     g4     g5     g6     g7    
g2 1.0000 -      -      -      -      -      -     
g3 1.0000 1.0000 -      -      -      -      -     
g4 1.0000 1.0000 1.0000 -      -      -      -     
g5 0.0028 0.0028 0.0028 0.0028 -      -      -     
g6 0.0028 0.0028 0.0028 0.0028 1.0000 -      -     
g7 0.0028 0.0028 0.0028 0.0028 1.0000 1.0000 -     
g8 0.0028 0.0028 0.0028 0.0028 1.0000 1.0000 1.0000

P value adjustment method: none 
Glen_b
  • 257,508
  • 32
  • 553
  • 939
  • +1. I made some light formatting changes re your data, as I think they are better aligned & easier to read this way. If you don't like it, roll it back w/ my apologies. – gung - Reinstate Monica Jan 23 '14 at 02:43
  • @gung On second thought, with a reduction in number of figures presented, this way looks fine now. Thanks – Glen_b Jan 23 '14 at 02:57
  • It depends on how you have your zoom / font size set in your browser. Oddly, that differently affects the regular text & the code text. On my browser, it didn't require scrolling, but on others' I guess it did. I agree that having to scroll to read code is quite annoying. I think it looks well aligned & readable as it stands now. – gung - Reinstate Monica Jan 23 '14 at 02:58
  • Thank you very much, this is exactly what I was looking for. Now let me see if I got the intuition right. The reason the t-test between g1 and g4 (most extreme groups) fails but ANOVA reports significance, is that the variance of "intermediate" groups (g2 and g3) is lower than the variance of extreme groups (in this case g4 has by far the largest variance). ANOVA's estimate of the "within" (aka "error") variance is a weighted average among all groups, so including intermediate groups makes it smaller than without them, and this can ultimately lead to lower p-value. Correct? – amoeba Jan 23 '14 at 10:17
  • 1
    Actually group 4 has the smallest variance above; group 1 is the largest (not that it matters). However, I created some other examples (not presently in the post) in which one of the middle groups (with mean between the extremes) had the largest variance. However, the extreme groups tend to have larger variance compared to at least one of the middle two groups (driving the p-value of the F), so your basic idea seems to be correct. – Glen_b Jan 23 '14 at 11:27
  • amoeba - I added another example - which has three groups, fewer observations and almost identical variances in each group (but still has more observations in the center than the end groups). – Glen_b Jan 23 '14 at 12:47
  • 2
    The two-sample t-test is not the same thing as pairwise tests in the regression. The distinction lies in the estimates of residual variance. Thus your examples are not really examples of the stronger paradox, which is that *within one and the same regression analysis* the F-test can be significant while none of its pairwise comparisons are significant. I believe this paradox does not necessarily arise from heteroscedasticity, either: it can appear even when all group variances are equal. – whuber Jan 23 '14 at 14:26
  • @whuber I'm not quite sure whether you're telling me I answered a different question to the one being asked (which might be true but I don't read it that way), or simply that the usual pairwise multiple comparisons that are done post-hoc in an anova are different from the corresponding two sample-t (which is certainly the case, no dispute there) and that that comparison is the more interesting/relevant one. – Glen_b Jan 23 '14 at 14:30
  • 1
    Both, I think. I am suggesting it is would be more interesting to interpret the question as I have suggested rather than as you have answered (and am implicitly hinting that your solution could be modified to address the more interesting form of the question). – whuber Jan 23 '14 at 14:32
  • @whuber I'll have to look at doing it tomorrow. I have some other stuff to take care of. – Glen_b Jan 23 '14 at 14:34
  • 1
    Glen_b, I think you answered exactly the question I asked. I am not sure I even understand @whuber's interpretation, but I start to suspect that he misunderstood what I meant by "pairwise t-tests". I meant *two-sample t-tests*. If these are two different things, I am sorry for the confusion and would be grateful for further clarifications. I do understand whuber's point in the context of [usual continuous regression](http://stats.stackexchange.com/questions/14500/how-can-a-regression-be-significant-yet-all-predictors-be-non-significant), but I don't get what it means in the ANOVA context. – amoeba Jan 23 '14 at 15:33
  • 4
    More interesting still might be to address when it's _possible_ for the F-test to reject the null but none of the pairwise t-tests to reject it at the same significance level (using the same error variance estimate as the F-test). E.g. for 3 groups with equal sample sizes the union of the 5% rejection region for the pairwise t-tests contains the 5% rejection region for the ANOVAR F-test even when the sample size gets very large. – Scortchi - Reinstate Monica Jan 23 '14 at 15:57
  • 1
    amoeba: Usually you'd do the post-hoc t-tests using the residual mean square error from the ANOVAR to calculate the standard error estimate for the difference of means, rather than calculate it separately from only the two groups involved in each pairwise test. I think that's what @whuber meant. – Scortchi - Reinstate Monica Jan 23 '14 at 16:02
  • 1
    @Scortchi: if I understand you correctly, then I referred to exactly this case in my "remark 2" in the original question (see also links there). My understanding is that post-hoc t-tests basically demand lower p-values to declare significance (it is obvious in the Bonferroni case), so to have significant $F$ and no significant post-hoc $t$ is *easier* than to have significant $F$ and no significant "raw" $t$. At the same time, whuber has just said that he is talking about "stronger paradox", not a weaker one. Hence my confusion. – amoeba Jan 23 '14 at 17:50
  • 4
    @Scortchi: (+1) to your comment. Years ago, I remember working out that it is indeed impossible in the three-group case unless the level of the test is very small (something below $0.005$, if I recall). It comes down to relationships on the ratio of cumulative $F$-distributions with different numerator degrees of freedom, evaluated at a fixed level and fixed denominator degrees of freedom. – cardinal Jan 23 '14 at 17:52
  • 4
    Amoeba, the confusion arises from the fact that "pairwise t-tests" can mean two things. In the ANOVA context, it would usually be understood to mean *post hoc* contrasts using the ANOVA estimates. As others have pointed out, this is not the same as conducting the usual t-test on pairs of groups, because the ANOVA version is based on an estimate of the within-group variance derived from *all* the groups. – whuber Jan 23 '14 at 18:05
  • @whuber: I see. However, let's say we conducted usual pairwise t-tests (how can I call it co clarify: *a priori*?), and then ANOVA *post hoc* tests. When the p-values will generally be lower? Originally I thought post hocs are less prone to false positives, ergo have higher p-values; that's why a "significant-F-nonsignificant-t paradox" is stronger in case of a priori tests (unlike what you wrote above). Now I suspect it can be both ways, so two versions of this "paradox" are equally strong. – amoeba Jan 23 '14 at 21:29
  • 2
    I think you've made a good summary. I referred to the paradox as "stronger" in the sense that when all tests are conducted within the framework of a single ANOVA analysis, one would (naively) expect them to be internally consistent. (When you conduct two sets of tests that are not inherently related, it shouldn't be much of a surprise when they give conflicting results: this happens often.) We have to accept that it is logically consistent and statistically valid to conclude that group means vary significantly while not finding differences between any specific pairs of groups. – whuber Jan 23 '14 at 22:58
  • 1
    @whuber I believe the example now included at the end works for multiple comparisons (as well, of course, for the ordinary two-sample-t case). – Glen_b Jan 24 '14 at 00:53
  • 1
    Thank you. Your example takes exactly the right approach by maximizing between-group variance subject to a constraint on the maximum between-group effect (placing half the groups at one end of the range of group means and the other half at the other end). The same intuition suggests to me that by increasing the number of groups, you can find similar examples for more moderate thresholds of significance. I believe this is just a statistical way of rephrasing @cardinal's characterization of the phenomenon in terms of relationships among $F$ distributions; both points of view provide insight. – whuber Jan 24 '14 at 16:15
4

Summary: I believe that this is possible, but very, very unlikely. The difference will be small, and if it happens, it's because an assumption has been violated (such as homoscedasticity of variance).

Here's some code that seeks out such a possibility. Note that it increments the seed by 1 each time it runs, so that the seed is stored (and the search through seeds is systematic).

stopNow <- FALSE
counter <- 0
while(stopNow == FALSE) {
  counter <- counter + 1
  print(counter)
  set.seed(counter)
  x <- rep(c(0:5), 100)
  y <- rnorm(600) + x * 0.01
  df  <-as.data.frame( cbind(x, y))
  df$x <- as.factor(df$x)
  fit <- (lm(y ~ x, data=df))
  anovaP <- anova(fit)$"Pr(>F)"[[1]]
       minTtestP <- 1
      for(loop1 in c(0:5)){
        for(loop2 in c(0:5)) {
          newTtestP <- t.test(df[x==loop1,]$y, df[x==loop2,]$y)$p.value
      minTtestP <- min(minTtestP, newTtestP )    
      }
   }

  if(minTtestP > 0.05 & anovaP < 0.05) stopNow <- TRUE 
  cat("\nminTtestP = ", minTtestP )
  cat("\nanovaP = ", anovaP )
  cat("\nCounter = ", counter, "\n\n" )
}

Searching for a significant R2 and no non-significant t-tests I have found nothing up to a seed of 18,000. Searching for a lower p-value from R2 than from the t-tests, I get a result at seed = 323, but the difference is very, very small. It's possible that tweaking the parameters (increasing the number of groups?) might help. The reason that the R2 p-value can be smaller is that when the standard error is calculated for the parameters in the regression, all groups are combined, so the standard error of the difference is potentially smaller than in the t-test.

I wondered if violating heteroscedasticity might help (as it were). It does. If I use

y <- (rnorm(600) + x * 0.01) * x * 5

To generate the y, then I find a suitable result at seed = 1889, where the minimum p-value from the t-tests is 0.061 and the p-value associated with R-squared is 0.046.

If I vary the group sizes (which increases the effect of violation of heteroscedasticity), by replacing the x sampling with:

x <- sample(c(0:5), 100, replace=TRUE)

I get a significant result at seed = 531, with the minimum t-test p-value at 0.063 and the p-value for R2 at 0.046.

If I stop correcting for heteroscedasticity in the t-test, by using:

newTtestP <- t.test(df[x==loop1,]$y, df[x==loop2,]$y, var.equal = TRUE)$p.value

My conclusion is that this is very unlikely to occur, and the difference is likely to be very small, unless you have violated the homoscedasticity assumption in regression. Try running your analysis with a robust/sandwich/whatever you want to call it correction.

Jeremy Miles
  • 13,917
  • 6
  • 30
  • 64
  • You seem to have an unfinished sentence starting with "If I stop correcting for heteroscedasticity in the t-test". Apart from that, thanks a lot! Please see my update to the question. Also note @whuber's first comment up here; if I understand correctly, he insists that such a situation can easily (?) happen (and calls it "well known"). Maybe there is some misunderstanding here, but what is it? – amoeba Jan 22 '14 at 23:19
  • I think @whuber is talking about non-significant parameters in the model, not non-significant t-tests. – Jeremy Miles Jan 22 '14 at 23:29
  • No, he's not. If it's well known, I don't know it and I've tried to come up with an example, and can't. – Jeremy Miles Jan 22 '14 at 23:56
  • 1
    I am glad, then, that @Glen_b produced a simple example. The intuition is that the overall test assesses whether there is evidence that the spread in the group means cannot reasonably be explained by residual variance alone. The pairwise tests, involving only two means at a time, have to be considerably more conservative in evaluating the same evidence. Therefore even comparing the two extreme group means may fail to uncover a significant difference when the overall distribution of *all* means is significant. This sometimes occurs in practice, especially with large numbers of groups. – whuber Jan 23 '14 at 05:04
  • 3
    BTW, the reason for calling this "well known" stems from my recollection of being warned about it in the Systat software manual c. 1989. It was a *very* instructive manual (most of it written personally by [Leland Wilkinson](http://www.cs.uic.edu/~wilkinson/), the developer) and probably still is. The manual is online, but you have to register on the Systat site to be able to download it. – whuber Jan 23 '14 at 05:08
  • @whuber: one thing I still don't understand is the connection between Glen_b's example (that fully answers my original question) and your linked above [discussion](http://stats.stackexchange.com/questions/14500/how-can-a-regression-be-significant-yet-all-predictors-be-non-significant) of significant F and non-significant t in linear regression due to multicollinearity... Would be very grateful if you could elaborate or provide some hints. – amoeba Jan 23 '14 at 10:45
2

It's entirely possible:

  • One or more pairwise t-test is signfiicant but the overall F-test isn't
  • The overall F-test is significant but none of the pairwise t-test is

The overall F test tests all contrasts simultaneously. As such, it must be less sensitive (less statistical power) to individual contrasts (eg: a pairwise test). The two tests are closely related to each other but they are not reporting exactly the same thing.

As you can see, the textbook recommendation of not doing planned comparisons unless the overall F-test is significant is not always correct. In fact, the recommendation may prevent us from finding significant differences because the overall F test has less power than planned comparisons for testing the specific differences.

SmallChess
  • 6,764
  • 4
  • 27
  • 48
  • I am not sure I follow the logic of your answer. Are you saying that rejection of H0 by an F-test implies that there is at least one non-zero contrast, but this contrast might not correspond to any of the pairwise comparisons? If so, does this mean that if an F-test rejects H0, then at least one of the pairwise tests *across all possible contrasts* will lead to a rejection too? – amoeba Sep 03 '15 at 10:57
  • @amoeba I've edited my answer. – SmallChess Nov 07 '15 at 02:29