Does normal linear regression in R overcome confounding?

Question

When I run a linear regression model or logistic regression model in R like this

lm(outcome ~ treatment + covariate)

does the + covariate part help control for the effect of that covariate on both treatment and outcome or just on outcome? If it doesn't what exactly is this doing because people are telling me this controls for confounding, but I don't see how? if it doesn't control for likelihood of treatment then we need to do things like propensity score matching or IPTW, inverse probability treatment weights, at least to my understanding

no not ANCOVA. I'm trying to assess how to correctly deal with confounding because the `+ covariate` method doesn't seem to make sense to me for controlling for confounding — Jin, Oct 23 '19 at 01:50
Then do you mean that the covariate is correlated (related to, I guess not necessarily linearly) to the treatment? — Dave, Oct 23 '19 at 02:06

score 2 · Answer 1 · answered Oct 23 '19 at 03:34

Under some strict assumptions, regression of the outcome on the treatment and covariates does indeed control confounding by those covariates. See Schafer & Kang (2008) for more details. This approach is indeed called ANCOVA. This works because the interpretation of the coefficient on treatment is the effect of treatment holding constant the other variables. Holding those variables constant means their confounding effects are no longer in effect.

The assumptions required are very strict, though. First, you must assume the covariates are sufficient to remove confounding and do not induce confounding. You need a causal theory to justify this, and it isn't empirically verifiable. Second, you need the effect of the covariates on the outcome to be eaxactly as modeled; any nonlinear relationships or interactions must be accounted for. Third, there must not be any moderation of the treatment effect by the covariates, which is extremely unlikely. You can estimate an average treatment effect in the presence of moderation by interacting the treatment with the mean-centered versions of the covariates. Fourth, there must not be measurement error in the covariates or treatment. If there is, the coefficients will be biased in unpredictable directions (but most often downward).

The benefit of other methods like IPW and propensity score matching is to avoid some of these assumptions or replace them with others. For example, propensity score methods require that you have modeled the probability of treatment correctly. With this, you're trading one modeling assumption (that you have correctly modeled the outcome) for another modeling assumption (that you have correctly modeled the treatment assignment process). You don't need to make assumptions about the moderation of the treatment effect, which is one reason to prefer propensity score-based methods. You still need to ensure you've collected and included the right variables and that they are measured without error.

Note that the causal inference field is way beyond using linear regression to control for confounding. See my post here for contemporary methods.

For a little deeper explanation on how regression works: When you regress Y on A (treatment) and X (covariates), you get a coefficient for A that can be interpreted as the unique effect of A on Y holding constant the covariates. Another way to see this is the followong:

Regress Y on X. Take the residuals, R_Y. This is the part of Y that is independent from the covariates X. Now do the same with A: regress A on X (using a linear model, even though the actual assignment model might be nonlinear, e.g., logistic). Take the residuals R_A. This is the part of A that is independent from the covariates X. So now you have two variables, R_Y and R_A, that are completely purged of their (linear) association with the covariates X. If you regress R_Y on R_A, you get a coefficient. This coefficient is exactly equal to the coefficient you would get on A if you were to regress Y on A and X. This interpretation hopefully makes it clearer why including covariates in a regression of the outcome on treatment removes the confounding effects of the covariates (again, only under certain very strict assumptions).

Schafer, J. L., & Kang, J. (2008). Average causal effects from nonrandomized studies: A practical guide and simulated example. Psychological Methods, 13(4), 279–313. https://doi.org/10.1037/a0014268

Does normal linear regression in R overcome confounding?

1 Answers1

Linked