4

I have always thought that anova(myFit) performs submodel test for each regressor in the model (i.e. comparing model with all regressors with model without the tested regressor). However, this is clearly not true.

Toy example:

set.seed(0)
n = 20
data = data.frame(
  Y = rbinom(n,20,0.5),
  X1 = sample(LETTERS[1:3], n, T),
  X2 = rbinom(n,6,0.5)
)

anova(lm(Y ~ X1 + X2, data = data))
# Analysis of Variance Table
# Response: Y
#           Df Sum Sq Mean Sq F value Pr(>F)
# X1         2  1.883  0.9413  0.1677 0.8471
# X2         1  3.265  3.2648  0.5817 0.4568
# Residuals 16 89.803  5.6127

anova(lm(Y ~ X2, data = data), lm(Y ~ X1 + X2, data = data))
# Analysis of Variance Table
# Model 1: Y ~ X2
# Model 2: Y ~ X1 + X2
#   Res.Df    RSS Df Sum of Sq      F Pr(>F)
# 1     18 92.908
# 2     16 89.803  2    3.1057 0.2767 0.7619

In the second case the comparison of change in RSS is performed resulting in F(2,16) = 0.2767. What test is performed in the first case?

Karolis Koncevičius
  • 4,282
  • 7
  • 30
  • 47
Daniel Dostal
  • 654
  • 3
  • 10

1 Answers1

5

R has comprehensive documentation. For this specific case help(anova.lm) says:

Details:

Specifying a single object gives a sequential analysis of variance table for that fit. That is, the reductions in the residual sum of squares as each term of the formula is added in turn are given in as the rows of a table, plus the residual sum of squares.

The table will contain F statistics (and P values) comparing the mean square for the row to the residual mean square.

If more than one object is specified, the table has a row for the residual degrees of freedom and sum of squares for each model. For all but the first model, the change in degrees of freedom and sum of squares is also given. (This only make statistical sense if the models are nested.) It is conventional to list the models from smallest to largest, but this is up to the user.

Optionally the table can include test statistics. Normally the F statistic is most appropriate, which compares the mean square for a row to the residual sum of squares for the largest model considered. If ‘scale’ is specified chi-squared tests can be used. Mallows' Cp statistic is the residual sum of squares plus twice the estimate of sigma^2 times the residual degrees of freedom.

So, for the first case in your example, instead of comparing each covariate with the intercept-only model the covariates are added to the model one by one, hence the order matters. i.e.

> anova(lm(Y ~ X1 + X2, data = data))
Analysis of Variance Table
Response: Y
          Df Sum Sq Mean Sq F value Pr(>F)
X1         2  1.883  0.9413  0.1677 0.8471
X2         1  3.265  3.2648  0.5817 0.4568
Residuals 16 89.803  5.6127


> anova(lm(Y ~ X2 + X1, data = data))
Analysis of Variance Table
Response: Y
          Df Sum Sq Mean Sq F value Pr(>F)
X2         1  2.042  2.0417  0.3638 0.5549
X1         2  3.106  1.5528  0.2767 0.7619
Residuals 16 89.803  5.6127

For the second case it compares the sequence of nested models with the first one:

> anova(lm(Y ~ 1, data = data), lm(Y ~ X1, data = data), lm(Y ~ X1 + X2, data = data))
Analysis of Variance Table
Model 1: Y ~ 1
Model 2: Y ~ X1
Model 3: Y ~ X1 + X2
  Res.Df    RSS Df Sum of Sq      F Pr(>F)
1     19 94.950
2     17 93.067  2    1.8825 0.1677 0.8471
3     16 89.803  1    3.2648 0.5817 0.4568
Karolis Koncevičius
  • 4,282
  • 7
  • 30
  • 47
  • Thank you for the clear explanation! – Daniel Dostal Mar 17 '20 at 10:50
  • 1
    Also note that, for the one model case, people often don't want Type I (sequential) sums of squares. It's kind of a quirk with R, relative to some other software packages, that the default analysis is Type I SS. For general linear models (*lm()*), the usual go-to function is *car::Anova*. – Sal Mangiafico Mar 17 '20 at 11:10
  • 2
    [This answer](https://stats.stackexchange.com/a/20455/28500) provides a superb introduction to the different Types of ANOVA. – EdM Mar 17 '20 at 13:48