Why is post treatment bias a bias and not just multicollinearity?

Question

In this presentation by Gary King, he discusses post treatment bias as follows:

Post treatment bias occurs:

when controlling away for the consequences of treatment

when causal ordering among predictors is ambiguous or wrong

Example of avoidable post-treatment bias: Causal effect of Race on Salary in a firm

DO control for qualifications

DON'T control for position in the firm

Example of unavoidable post-treatment bias: Causal effect of democratization on civil war, do we control for GDP?

Yes, since GDP -> democratization, we must control to avoid omitted variable bias

No, since democratization -> GDP, we would have post treatment bias

I don't understand how this is a bias and not simply a problem of multi-collinearity? In the first example, if blacks tend to get low-ranking position, then yes, Race and Position are highly correlated and leads to high standard error. But why does it lead to bias?

Please clarify your question or accept my answer. As the question and bounty are currently phrased, I directly address the stated question in my answer (especially paragraphs 2 and 7). — Dr. Beeblebrox, Jul 29 '15 at 14:15
What I'm looking for is how to show the bias mathematically, perhaps with the potential outcome framework. All the intuition and definition of what is multi-collinearity and bias I'm already clear about. What I hope to get is a formal exposition of this sentence `Post-treatment bias is a problem because one of your control variables will mathematically “soak up” some of the effect of your treatment"`. Would it be possible to formally show the `mathematically "soak up"` part? — Heisenberg, Jul 29 '15 at 15:06
Do you have access to Gelman and Hill, "Data Analysis Using Regression and Multilevel/Hierarchical Models"? The section I suggested, Section 9.7, pages 188-190, contains a formal presentation of bias from controlling for a consequence of treatment. — Dr. Beeblebrox, Jul 29 '15 at 18:41
@jabberwocky the Gelman and Hill's section you suggested is very helpful. I have picked your answer. Though I'd suggest that you substantially reframe your answer along Gelman and Hill's explanation. The present answer goes into many unnecessary details (e.g. what is multi-collinearity, which is already known from the question) while missing the key point, i.e. explain post treatment bias formally (e.g. under the potential outcome framework), instead of just intuitively (e.g. "soak up variation") — Heisenberg, Jul 30 '15 at 06:02

score 9 · Accepted Answer · edited Apr 17 '20 at 19:45

First, let’s clear up a difference between the terms, and then discuss the respective problems that each causes.

Multi-collinearity refers to a problematic relationship between multiple right-hand-side variables (usually control variables) caused by their being highly correlated, regardless of causal ordering. Post-treatment bias refers to a problematic relationship between your treatment variable and at least one control variable, based on a hypothesized causal ordering. Furthermore, multi-collinearity and Post-treatment bias causes different problems if they are not avoided.

Multi-collinearity generally refers to a high correlation between multiple right-hand-side variables (usually two control variables) in a regression model, which is a problem. If a right-hand-side variable and your outcome variable were highly correlated (conditional on other right-hand-side variables), however, that would not necessarily be a problem; instead it would be suggestive of a strong relationship that might be of interest to the researcher.

Multi-collinearity between control variables does not impact the reliability of a model overall. We can still reliably interpret the coefficient and standard errors on our treatment variable. The negative side of multi-collinearity is that we can no longer interpret the coefficient and standard error on the highly correlated control variables. But if we are being strict in conceiving of our regression model as a notioanal experiment, where we want to estimate the effect of one treatment (T) on one outcome (Y), considering the other variables (X) in our model as controls (and not as estimable quantities of causal interest), then regressing on highly correlated variables is fine.

Another fact that may be thinking about is that if two variables are perfectly multicollinear, then one will be dropped from any regression model that includes them both.

For more, see: See http://en.wikipedia.org/wiki/Multicollinearity

Post-treatment bias occurs when the regression model includes a consequence of treatment as a control variable, regardless of how highly correlated the consequence-of-treatment control variable is with the treatment. Although generally the severity of post-treatment bias is increasing in the correlation between the treatment and the consequence-of-treatment control variable.

Post-treatment bias is a problem because one of your control variables will mathematically “soak up” some of the effect of your treatment, thus biasing your estimate of the treatment effect. That is, some of the variation in your outcome due to your treatment will be accounted for in the coefficient estimate on the consequence-of-treatment control variable. This is misleading because to estimate the full effect of treatment, you want all of the variation explained by the treatment to be included in the treatment variable's coefficient estimate.

As an example, we want to study the impact of race on salary. Imagine that race affects job position, which in turn affects salary, and the full effect of race on salary is due to the way that race changes people’s job position. That is, other than how race affects job position, there is no effect of race on salary. If we regressed salary on race and controlled for job position, we would (correctly, mathematically speaking) find no relationship between race and salary, conditional on job position.

To highlight how controlling for a consequence of treatment biases your treatment estimate, consider the difference between a researcher interested in the total effect of a treatment versus the direct effect of a treatment. If we want to study the total impact of race on salary we do not care how that effect is mediated. We care about all pathways linking race and salary. We do not want to control for any variable that mediates the effect of race on salary. If we care about only the direct effect of race on salary (although this research question smacks of pre-Darwinian scientific racism), we want to exclude any "mediated" effects from our treatment estimate. So we would want to control for job position, education, social networks, etc. These change the treatment estimate. If our goal is to estimate the direct effect, then control for the consequences of treatment. If our goal is to estimate the total effect, however, controlling for these consequences of treatment biases our treatment estimate.

For more intuition through example, refer to Gelman and Hill (2007) "Data Analysis Using Regression and Multilevel/Hierarchical Models," pages 188-192.

what does "soak up" mean mathematically? How is it different than a high correlation between Race and Position, two right hand sided variables? — Heisenberg, Jul 26 '15 at 22:57
@Heisenberg I revised the answer in an attempt to make it clearer. Hopefully it helps. Let me know. — Dr. Beeblebrox, Jul 27 '15 at 21:48
Thank you for the effort, but the difference is not yet clear. With multi-collinearity, two RHS variables are highly correlated (that's clear). With post-treatment bias, what's the relationship between those two RHS variables? If they are just correlated, then it's no different from multi-collinearity. If they are something other than just being correlated, then what is that relationship mathematically? — Heisenberg, Jul 27 '15 at 21:52
Think of it this way: suppose we have causal effects that operate like this: $X \rightarrow Y \rightarrow Z$. Thus, the effect of $X$ is that it effects $Y$, which then effects $Z$. If you include $Y$ in your regression model, then the estimated effect of $X$ is the average change in $Z$, given a one unit change in $X$ and *holding $Y$ constant*. But if you leave out $Y$, the estimate effect of $X$ is the average change in $Z$ for a one unit change in $X$ (*allowing* for $Y$ to change with changes in $X$). — Cliff AB, Jul 27 '15 at 22:05
This is one of the times when thinking of it mathematically can be misleading (i.e. just looking at it mathematically, you will see that $Y$ is higher correlated with $Z$ than $X$ is). Remember that teasing out causal relations is really a logical, rather than mathematical, challenge. All the raw data in the world doesn't tell you whether you have an observational study or experiment, but just knowing where the data comes from does. — Cliff AB, Jul 27 '15 at 22:12
@Heisenberg, I think you should focus on the paragraph 2 ("Multi-collinearity refers to a problematic relationship...") and paragraph 7 ("Post-treatment occurs when the regression model..."). These highlight two differences: (1) multi-collinearity is about correlation and post-treatment bias (PTB) is about hypothesized causal ordering and (2) multi-collinearity biases your estimates of control variables and post-treatment controls bias your treatment effect estimate. NB: Cliff AB introduced a DAG here -- i.e., PTB depends on theorized causal structure (See Judea Pearl on DAGs). — Dr. Beeblebrox, Jul 28 '15 at 02:58
You guys make great point about causal relation is a logical issue. But I'm a little hung up on the fact that OLS is supposed to be unbiased as long as there is no correlation between the RHS variables and the error term (i.e. no endogeneity). So I'm wondering whether we can frame this post treatment bias in terms of correlation with the error term somehow? In the `X -> Y -> Z` framework, perhaps controlling for `Y` introduces endogeneity? — Heisenberg, Jul 28 '15 at 03:04
@Heisenberg Yes, under certain conditions, OLS is BLUE: i.e., among unbiased linear estimators, it is the most statistically efficient ("best"). But it's not unbiased in estimating the *true* causal effect you care about. It is only unbiased in estimating whatever you ask for! Wrong question-->wrong answer. To understand the "right question," read the referenced section in Gelman and Hill about why not to control for consequences of treatment. Then read Pearl (2015), "Conditioning on Post-treatment Variables" for a contrarian voice about when you *do* want to control for them. — Dr. Beeblebrox, Jul 28 '15 at 03:23
@Heisenberg. To make jabberwocky's response a bit more concrete with the race-salary example. (1) If you think firms will employ every employee at a job that is appropriate for their skill level but discriminate in salary, then control for job position and OLS will give you the 'best' test. (2) If you think firms will also discriminate in promotion decisions, don't control for job position and OLS will give you the 'best' test. (3) If you want to test whether 'society' is discriminating, don't control for job position and don't control for qualifications and OLS will give you the 'best' test. — stijn, Jul 28 '15 at 04:54
@Heisenberg Also, to clarify the Gauss–Markov theorem, OLS is not "supposed to be unbiased." When Gauss–Markov assumptions are met, OLS *is* an unbiased estimator of your model's estimand (the parameter to be estimated), according to the mathematical definition of unbiasedness. For some tasks (e.g., predictive models) you'd actually prefer a biased estimator with lower variance (i.e., overall, lower mean squared error). "OLS is BLUE" is saying, within the constrained universe of linear and unbiased models, OLS is the most statistically efficient of that subset of potential models. — Dr. Beeblebrox, Jul 28 '15 at 14:57

Why is post treatment bias a bias and not just multicollinearity?

1 Answers1