Implementing ANCOVA via residuals

Question

While it is often mis-used, I'd like to provide users of ezANOVA() (from the ez package for R) with the ability to specify covariates for ANCOVA. Since ezANOVA() uses a combination of stats::aov() and car::ANOVA(), I think the simplest way for me to achieve ANCOVA within ezANOVA() is to simply compute two anovas, one that includes just the covariates as predictors from which I would obtain residuals that would be used as the predicted data for a second anova that uses the regular non-covariate predictors. Does anything look inherently wrong about this approach?

(P.S. I thought I'd also include some evaluations of ANCOVA's assumptions, like checking that the covariates don't correlate with any of the non-covariate predictors, and checking that the covariates' effects on the DV don't interact with those of the non-covariate predictors.)

score 3 · Accepted Answer · edited Jun 11 '20 at 14:32

This question is very old, so you probably figured out long ago that this isn't going to work, but I thought I'd answer anyway, so that it doesn't remain officially 'unanswered'.

This won't work. So long as there is any correlation between the covariate and the factor, the effect estimates and the $p$-values will differ between the two approaches. Consider:

set.seed(6523)
x = rep(c("T", "C"), each=20)
c = sort(rnorm(40))            # the `sort()` makes these highly correlated, r = -.78
y = 15 + ifelse(x=="T",2,0) - .5*c + rnorm(40)

mod1  = lm(y~x+c)
mod2a = lm(y~c)
mod2b = lm(resid(mod2a)~x)
summary(mod1)
# ...
# Coefficients:
#             Estimate Std. Error t value Pr(>|t|)    
# (Intercept)  15.0422     0.2574  58.440   <2e-16 ***
# xT            1.7139     0.4962   3.454   0.0014 ** 
# c            -0.5832     0.2629  -2.218   0.0328 *  
# ...
summary(mod2b)
# ...
# Coefficients:
#             Estimate Std. Error t value Pr(>|t|)  
# (Intercept)  -0.3341     0.2365  -1.413   0.1659  
# xT            0.6683     0.3345   1.998   0.0529 .
# ...

Now, I recognize that you say you will check "that the covariates don't correlate with any of the non-covariate predictors", but this still won't work. You need the factors and the covariates to be perfectly uncorrelated, not only not 'significantly' correlated, and in the sample, not only in the population. Consider:

set.seed(6523)
x = rep(c("T", "C"), each=20)
c = rnorm(40)                  # `x` & `c` are perfectly uncorrelated in the population
y = 15 + ifelse(x=="T",2,0) - .5*c + rnorm(40)
cor.test(as.numeric(x=="T"),c)
# ...
# data:  as.numeric(x == "T") and c
# t = -0.8869, df = 38, p-value = 0.3807
# ...
#        cor 
# -0.1424126

This time the correlation is small and non-significant, but the values still differ (a little bit):

mod1  = lm(y~x+c)
mod2a = lm(y~c)
mod2b = lm(resid(mod2a)~x)
summary(mod1)
# ...
# Coefficients:
#             Estimate Std. Error t value Pr(>|t|)    
# (Intercept)  14.9955     0.2197  68.262  < 2e-16 ***
# xT            1.8245     0.3132   5.826 1.08e-06 ***
# c            -0.5446     0.1659  -3.282  0.00225 ** 
# ...
summary(mod2b)
# ...
# Coefficients:
#             Estimate Std. Error t value Pr(>|t|)    
# (Intercept)  -0.8938     0.2183  -4.095 0.000213 ***
# xT            1.7875     0.3087   5.791  1.1e-06 ***
# ...

The closer you get to perfectly uncorrelated, $r = 0.\bar 0$, in your sample, the closer the values will be. In an experimental setup, factors may often be perfectly uncorrelated, but covariates will rarely be. That will make this strategy cumbersome and ineffective.

For more, related information, it may help you to read @whuber's answer here: How can adding a 2nd IV make the 1st IV significant? The key portion is:

Recall that multiple regression of $Y$ against $X_1$ and $X_2$ is equivalent to

Separately regress $Y$ and $X_1$ against $X_2$.

Regress the Y residuals against the $X_1$ residuals.

Implementing ANCOVA via residuals

1 Answers1