What variables to include/exclude when estimating causal relationships using regression

Question

I'm aware that there's a lot of disagreement over the possibility of estimating causal relationships using observational data. But say you're broadly sympathetic to the idea of trying to do so (maybe because you're in a field where you can't run experiments, yet need to ask causal questions). What variables should you include (i.e., control for) or exclude when attempting to estimate causal relationships using regression in particular?

Say I’m interested in estimating the causal effect of X on Y. Here’s what I think I have a handle on:

Unless I control for (i.e., include) any confounders, in the sense of variables that have an effect on both X and Y, I’ll mis-estimate the coefficient for X.
However, I need to be careful not to control for any colliders, and in so doing introduce spurious, non-causal relationships, potentially opening up back-door paths between X and Y that otherwise would’ve remained blocked.
I also want to avoid controlling for mediators that might lie on a direct or indirect path between X and Y, since that would have me block (if on a direct path) or mis-estimate (if on an indirect path) the effect of X on Y.

So, if that’s right: control for confounders, but don’t control for colliders and mediators.

But here’s what I’m not clear on:

What about a variable - call it Z - that has a causal effect on Y, but not on X, and as such isn’t a confounder? Should such a variable be included? It seems to me that it should, since I’m otherwise asking the coefficient for X to account for any effect of Z on Y, but I might be wrong.

gung - Reinstate Monica · Accepted Answer · 2020-01-29T18:54:50.073

Your basic understanding is correct: You need to control for all relevant confounders, but not for other variables correlated with X for other reasons (i.e., colliders and mediators). As far as I'm aware, this isn't controversial—it has been proven mathematically. However, I do think this is much more difficult in practice than people seem to believe.

After that, controlling for other variables that are uncorrelated with X is the same as in any other multiple regression model. Namely, if those variables make a substantial contribution to Y, controlling for them will reduce the intrinsic noise (residual variance) and give you a clearer picture of the X $\rightarrow$ Y relationship (i.e., greater power). Alternatively, controlling for wholly irrelevant variables will reduce your residual degrees of freedom and thereby reduce your power (albeit, typically not by much if you have a reasonable sample size relative to the number of included variables). For a fuller discussion of this topic, it may help to read my answer here: How can adding a 2nd IV make the 1st IV significant?

What variables to include/exclude when estimating causal relationships using regression

1 Answers1