I'm aware that there's a lot of disagreement over the possibility of estimating causal relationships using observational data. But say you're broadly sympathetic to the idea of trying to do so (maybe because you're in a field where you can't run experiments, yet need to ask causal questions). What variables should you include (i.e., control for) or exclude when attempting to estimate causal relationships using regression in particular?
Say I’m interested in estimating the causal effect of X on Y. Here’s what I think I have a handle on:
- Unless I control for (i.e., include) any confounders, in the sense of variables that have an effect on both X and Y, I’ll mis-estimate the coefficient for X.
- However, I need to be careful not to control for any colliders, and in so doing introduce spurious, non-causal relationships, potentially opening up back-door paths between X and Y that otherwise would’ve remained blocked.
- I also want to avoid controlling for mediators that might lie on a direct or indirect path between X and Y, since that would have me block (if on a direct path) or mis-estimate (if on an indirect path) the effect of X on Y.
So, if that’s right: control for confounders, but don’t control for colliders and mediators.
But here’s what I’m not clear on:
What about a variable - call it Z - that has a causal effect on Y, but not on X, and as such isn’t a confounder? Should such a variable be included? It seems to me that it should, since I’m otherwise asking the coefficient for X to account for any effect of Z on Y, but I might be wrong.