Difference in Significance for a T and F test

Question

After conducting a one way ANOVA, an F-test indicates that I should retain the null hypothesis (no two means differ), but the t-test (in line with the theory) tells me that the groups I tested are significantly different from each other. Why is one showing me that it's significant and the other saying it's not?

When using a t-test after an ANOVA, it's recommended to make a correction to the p-value, such as the Bonferroni correction. Have you done that? See a [related question](https://stats.stackexchange.com/q/83030). — Ertxiem - reinstate Monica, Oct 31 '19 at 03:23
Since the omnibus test was not significant, I decided not to conduct any post hoc analyses. However, I did plan that one comparison before I ran the ANOVA: between the two groups. I made sure to use a corrected t-test in case assumptions were violated. Thus, it is not biased/inflated. — shay, Oct 31 '19 at 03:29
The $F$-test has the null-hypothesis $\text{H}_0: \mu_1 = \mu_2 = \dots = \mu_k$. It does not look for pairwise differences, but rather compares the means against the grand mean. This is a different null, so of course there are cases where one is significant and the other is not. On a different note, if you had planned a single comparison, why bother with the ANOVA? — Frans Rodenburg, Oct 31 '19 at 06:37
@FransRodenburg: Emphasizing your last sentence: "[I]f you had planned a single comparison, why bother with the ANOVA?" — BruceET, Oct 31 '19 at 21:46

BruceET · Answer 1 · 2019-10-31T22:51:53.060

Suppose the F-test in a one-way ANOVA is not significant, and then you do an (ill-advised) ad hoc t test on the two levels of the factor with the greatest difference in means. There is a perhaps surprisingly high probability that t test will show a "significant" result.

Example: In the following simulation there are five levels of the factor, all with equal population means (20). So the truth is that there are no significant differences. As expected, the F-test at the 5% level rejects about 5% of the time. [Welch tests are done for the ANOVA; also, for the t tests below; for our data, all population standard deviations are 5.]

However, there are ${5 \choose 2} = 10$ pairs of levels that might be compared. The most likely of them to show an ad hoc "significant" difference arises if we 'cherry-pick' the pair with the greatest difference in means. If we do that, then we will reject in about 24% of the cases, where the F test failed to reject. In about 28% of cases overall regardless of the outcome of the F test. [It is too easy to rationalize (= lie to oneself) that one tantalizing difference is somehow 'key' and deserves to be examined on its own.]

Simulation in R with 10,000 iterations of the ANOVA, each with data sampled as specified above.

set.seed(2019)
m = 10^4;  p.f = p.tx = numeric(m)
for(i in 1:m) {
 x1 = rnorm(20, 10, 5);  x2 = rnorm(20, 10, 5)
 x3 = rnorm(20, 10, 5);  x4 = rnorm(20, 10, 5)
 x5 = rnorm(20, 10, 5)
 x = c(x1, x2, x3, x4, x5)
 g = as.factor(rep(1:5, each=20))
 p.f[i] = oneway.test(x ~ g)$p.val
 MAT=rbind(x1,x2,x3,x4,x5)
 a = apply(MAT,1,mean)
 mx = which(a==max(a));  mn = which(a==min(a))
 p.tx[i] = t.test(MAT[mx,],MAT[mn,])$p.val
}

mean(p.f<.05);  mean(p.tx<.05)
[1] 0.0503
[1] 0.2755
mean(p.tx[p.f > .05] < .05)
[1] 0.2373381

If you were to do five ad hoc comparisons among the means after getting a (untruthful) significant result from the F test, then a Bonferroni correction to avoid 'false discovery' would require testing these comparisons at the 1% level. With that correction, you would have a suitably low probability (around 0.03) of 'discovering' a difference between levels with lowest and highest sample means in ad hoc tests.

mean(p.tx[p.f > .05] < .01)
[1] 0.0293777

There are two important steps to prevent false discovery:

(a) If F-test is not significant, then don't even look for differences among pairs of levels.

(b) If F-test significant, then use some method (such as Bonferroni) to account for multiple ad hoc tests.

Addendum: It happens that the last sample in the simulation above had a non-significant F-test (P-value 0.158) and a "significant" t test between groups with smallest and largest means (P-value 0.01835):

oneway.test(x~g)

        One-way analysis of means 
     (not assuming equal variances)

data:  x and g
F = 1.8278, num df = 4.000, denom df = 47.141,  
  p-value = 0.1392

DTA=matrix(x,byrow=T, nrow=5)
rowMeans(DTA)
[1] 11.138243 10.463459  7.736375  8.795445  8.516991

t.test(x1, x3)

        Welch Two Sample t-test

data:  x1 and x3
t = 2.4647, df = 37.999, p-value = 0.01835
alternative hypothesis: 
  true difference in means is not equal to 0
95 percent confidence interval:
 0.6077037 6.1960313
sample estimates:
mean of x mean of y 
11.138243  7.736375

score 0 · Answer 2 · answered Nov 01 '19 at 01:27

I will try to give some intuition, why this might happen:

Assume you take random samples from each of 10 groups and you think that all true group means are equal.

Now you want to construct a test to check this hypothesis using the observed mean values. You want the test to perform such, that if the hypothesis is correct, only in 1 out of 20 cases the test should wrongly conclude, that the groups have different mean values.

Would such a test not allow a considerable deviation between the group means, because it "knows", that if it looks at many groups at once, some groups will differ from each other just by chance? If the test would not "consider" this fact, it would be wrongly concluding difference more often, than 1/20 of the time.

Now repeat above considerations, but you design the test just for the comparison of two groups. The test now "knows", that by just looking at two groups, large differences by chance are not that likely, and therefore would not allow so large observed differences between the groups, because it does not need to to hold the 1/20 error rate.

Now, if you looked at two groups where you observed a difference, it is possible, that the difference is too large for the t-test to remain at the null hypothesis, but still look plausible for the F-test, which "knows" that the more groups are observed, the more likely it is to find deviations from the overall mean, and accounts for that.

Difference in Significance for a T and F test

2 Answers2