12

What explains the differences in p-values in the following aov and lm calls ? Is the difference only due to different types of sums-of-squares calculations ?

set.seed(10)
data=rnorm(12)
f1=rep(c(1,2),6)
f2=c(rep(1,6),rep(2,6))
summary(aov(data~f1*f2))
summary(lm(data~f1*f2))$coeff
amoeba
  • 93,463
  • 28
  • 275
  • 317
Remi.b
  • 4,572
  • 12
  • 34
  • 64

2 Answers2

13

summary(aov) uses so called Type I (sequential) sums of squares. summary(lm) uses so called Type III sums of squares, which is not sequential. See gung's answer for details.


Note that you need to call lm(data ~ factor(f1) * factor(2)) (aov()automatically converts the RHS of the formula to factors). Then note the denominator for the general $t$-statistic in linear regression (see this answer for further explanations):

$$t = \frac{\hat{\psi} - \psi_{0}}{\hat{\sigma} \sqrt{\bf{c}' (\bf{X}'\bf{X})^{-1} \bf{c}}}$$

$\bf{c}' (\bf{X}'\bf{X})^{-1} \bf{c} $ differs for each tested $\beta$ coefficient because the vector $\bf{c}$ changes. In contrast, the denominator in the ANOVA $F$-test is always MSE.

amoeba
  • 93,463
  • 28
  • 275
  • 317
caracal
  • 11,549
  • 49
  • 63
  • 1
    I think the first sentence of this answer is wrong. The difference seems to be *precisely* due to different types of sum of squares: namely, type I vs. type II/III. Type I is sequential, which is what `lm` reports, whereas Type II/III is not. This is explained in quite some detail in @gung's answer that you linked to. – amoeba Apr 17 '17 at 21:12
  • @amoeba What do you suggest to correct the answer? – caracal Apr 24 '17 at 07:39
  • I edited the first paragraph, see if you are okay with the edit, and feel free to change it as you like. – amoeba Apr 24 '17 at 11:04
2
set.seed(10)
data=rnorm(12)
f1=rep(c(1,2),6)
f2=c(rep(1,6),rep(2,6))
summary(aov(data~f1*f2))
            Df Sum Sq Mean Sq F value Pr(>F)
f1           1  0.535  0.5347   0.597  0.462
f2           1  0.002  0.0018   0.002  0.966
f1:f2        1  0.121  0.1208   0.135  0.723
Residuals    8  7.169  0.8962               
summary(lm(data~f1*f2))$coeff
               Estimate Std. Error    t value  Pr(>|t|)
(Intercept)  0.05222024   2.732756  0.0191090 0.9852221
f1          -0.17992329   1.728346 -0.1041014 0.9196514
f2          -0.62637109   1.728346 -0.3624106 0.7264325
f1:f2        0.40139439   1.093102  0.3672066 0.7229887

These are two different codes. from the Lm model you need the coefficients. while from the aov model you are just tabulating the sources of variation. Try the code

anova(lm(data~f1*f2))
Analysis of Variance Table

Response: data
          Df Sum Sq Mean Sq F value Pr(>F)
f1         1 0.5347 0.53468  0.5966 0.4621
f2         1 0.0018 0.00177  0.0020 0.9657
f1:f2      1 0.1208 0.12084  0.1348 0.7230
Residuals  8 7.1692 0.89615   

This gives the tabulation of the sources of variation leading to the same results.

user157663
  • 21
  • 2
  • 2
    This does not appear to answer the question, which asks why the p-values for `f1` and `f2` differ in the two summaries of your top panel. It looks like you are only showing that `summary(aov(...))` and `anova(lm(...))` in `R` have similar output. – whuber Apr 17 '17 at 21:26