2

Assume the model:

$$Y = \beta_0 + \beta_1x_1 + \beta_2x_2 + \varepsilon$$

Where $x_1$ is a continuous variable of no interest and $x_2$ is a group variable (0/1). I would like to estimate the effect size Cohen's $d$ of $x_2$ on the outcome measure $Y$, while accounting for $x_1$. Is it valid to compute Cohen's $d$ by:

  1. regressing out the effect of $x_1$ from $Y$:

    $$Y = \beta_0 + \beta_1x_1 + \varepsilon$$

  2. and using the residuals from step 1 as outcome and $x_2$ as predictor to calculate the effect size for this new outcome measure?

If this is not a valid way of estimating the effect size of $x_2$ in a regression analysis, please explain why not.

gung - Reinstate Monica
  • 132,789
  • 81
  • 357
  • 650
Vincent
  • 281
  • 2
  • 10

1 Answers1

3

It's not a valid way to do it. Among other things, $x_1$ and $x_2$ can be correlated. Here is a simple simulation (coded in R):

set.seed(9684)                                   # makes this perfectly reproducible
x1 = c(rnorm(20), rnorm(20, mean=1))
x2 = rep(0:1, each=20)
cor(x1, x2)                                      # [1] 0.4715828  these are correlated
out.mat = matrix(NA, ncol=3, nrow=10000)
colnames(out.mat) = c("ignore x1", "regress out x1", "control for x1")
for(i in 1:10000){
  y  = 5 + 3*x1 +.5*x2 + rnorm(40, mean=0, sd=1) # the true d is .5
  out.mat[i,1] = (mean(y[21:40])-mean(y[1:20]))/sd(y)
  r = resid(lm(y~x1))
  mr = lm(r~x2)
  out.mat[i,2] = coef(mr)[2]/summary(mr)$sigma
  m2 = lm(y~x1+x2)
  out.mat[i,3] = coef(m2)[3]/summary(m2)$sigma
}
t(apply(out.mat, 2, summary))  # only the estimate from mult reg is unbiased
#                      Min.   1st Qu.    Median      Mean   3rd Qu.     Max.
# ignore x1       0.5409884 0.9492157 1.0073437 1.0063739 1.0646372 1.283686
# regress out x1 -0.8305609 0.2054523 0.3977148 0.4004736 0.5911994 1.473212
# control for x1 -1.0824200 0.2611255 0.5077147 0.5162029 0.7602736 2.043803

It may help you to read my answer to Is there a difference between 'controlling for' and 'ignoring' other variables in multiple regression? You might also want to look at how @whuber uses a series of simple linear regressions to match multiple regression here: How can adding a 2nd IV make the 1st IV significant?

gung - Reinstate Monica
  • 132,789
  • 81
  • 357
  • 650