15

The following webpage says that:

We should not control for a collider variable!

Which OLS assumptions are colliders violating?

robertspierre
  • 1,358
  • 6
  • 21
  • 8
    NONE! That is the pernicious part about colliders. All of the assumptions of OLS can be satisfied and yet we can still be fooled by colliders. – Demetri Pananos May 22 '21 at 01:06
  • 1
    @DemetriPananos If all the assumptions of OLS are satisfied that the estimate is unbiased. And consistent. Then what is the problem? How can we be fooled? – robertspierre May 22 '21 at 10:08
  • 1
    ["T-consistency vs. P-consistency"](https://stats.stackexchange.com/questions/265739) might be helpful. – Richard Hardy May 22 '21 at 14:59
  • 3
    Causal effects are not correlational effects. See Chapter 1 of Hernán, M. A., & Robins, J. M. (2020). [*Causal Inference: What If*](https://www.hsph.harvard.edu/miguel-hernan/causal-inference-book/). Chapman & Hall/CRC. – Alexis May 22 '21 at 16:24

3 Answers3

18

I will assume models without intercepts to have shorter notation. Say the structural causal model is \begin{aligned} Y&=\beta_1X+u, \\ Z&=\gamma_1X+\gamma_2Y+v, \\ X&=w \end{aligned} with $u,v,w$ being mutually independent zero-mean exogenous structural errors so that $Z$ is a collider: $X\rightarrow Z\leftarrow Y$.

Let us specify a linear regression as $$ Y=\alpha_1X+\alpha_2Z+\varepsilon $$ and get ready to estimate it with OLS. We would wish for $\hat\alpha_1^{OLS}\rightarrow\beta_1$ as $n\rightarrow\infty$. This would be the case if the following two conditions held simultaneously:

  1. $\alpha_1=\beta_1$ and
  2. the relevant OLS assumptions were satisfied.

However, this is not the case. Suppose $\alpha_1=\beta_1$. Then from the structural causal model and the specified regression we get \begin{aligned} \varepsilon&=-\alpha_2Z+u \\ &=-\alpha_2(\gamma_1X+\gamma_2Y+v)+u. \end{aligned} Thus $\varepsilon$ is a linear function of $X$. This violates the assumption $\mathbb{E}(\varepsilon|X)=0$. This assumption is what Wooldridge calls Assumption MLR.4 (Zero Conditional Mean) in "Introductory Econometrics: A Modern Approach". Note that it is specific to the desired causal interpretation of regression parameters; noncausal interpretations (such as regression as a model of the conditional expectation function of $Y|X,Z$) do not require it. Since it is violated, we cannot have both conditions above to hold simultaneously. Therefore, $\beta_1$ cannot be the target to which the OLS estimator of $\alpha_1$ converges.

Richard Hardy
  • 54,375
  • 10
  • 95
  • 219
  • I have not really thought this through, but if I am lucky this might happen to be correct. Or it might not. – Richard Hardy May 22 '21 at 16:04
  • This is also very nice (+1) – Robert Long May 22 '21 at 16:21
  • @RobertLong, thank you! It was an improvisation, but hopefully it got us somewhere better than where we (or at least I) started from. – Richard Hardy May 22 '21 at 16:49
  • @Richard Hardy. Hi Richard, in this example the 'imperfect model' is a structural equation or a regression ones? – markowitz May 22 '21 at 19:20
  • @markowitz, the imperfect model is our model which falsely includes $Z$ and thus does not coincide with the structural causal model that is the truth. – Richard Hardy May 23 '21 at 06:32
  • The problem is that your example is a bit confusing, and your comment do not help. Here terminology matter a lot and 'model' is ambiguous name. I suppose it is a regression and in this case $\alpha_1$ do not identify $\beta_1$; this is ok and for this reason control for collider $Z$ is a bad idea. The problem is that you pretend to use exogenety assumption on reg equation as support argument. This is precisely what is wrong. Indeed in stat terms this ind condition is true not false! At the other way, if the 'model' is structural it is inconsistent with the SCM. So you use contradictory assum. – markowitz May 23 '21 at 07:30
  • Moreover the book you refers on is not a good ref. It is quite ambiguous about causal inference. Indeed MRL. 4 assumpion is quite problematic therein. Moreover causal concepts like collider never appear therein. I analyzed carefully this book. Read the related link in my answer. – markowitz May 23 '21 at 07:35
  • @markowitz, I have read the link in your answer and am aware of the criticism. The assumption is fairly standard nevertheless, and it rules out inclusion of colliders if one seeks the optimality of an OLS regression. I have tried to make a clear distinction between the structural causal model (the causal truth / causal DGP) and our imperfect model of it. If you have a suggestion on how to rephrase it in a clearer way, you are welcome to share it. (I think *model* is a better term than *regression* here because of the [ambiguity of the latter](https://stats.stackexchange.com/questions/173660).) – Richard Hardy May 23 '21 at 09:51
  • @markowitz, however, I find it difficult to understand what you mean by *The problem is that you pretend to use exogenety assumption on reg equation as support argument. This is precisely what is wrong. Indeed in stat terms this ind condition is true not false!* Could you help me with that? – Richard Hardy May 23 '21 at 09:52
  • Unfortunately ambiguities move in several directions. Later I will try to give you a more exhaustive answer about your question about the exact meaning of “regression”. Now I say you that It is synonym for CEF (as said in the link in my answer). However hardly a most generic as possible terminology as “model” can help to disentangle ambiguities; probably It add others. About the “clear distinction” you tried to made, I can guess what you have in mind but it seems me a bad idea. – markowitz May 23 '21 at 11:06
  • The clear distinction should be between regression equations and structural equations, no something else. From your last comment I understand that the “imperfect model” is a structural equation, and later you estimate a regression equation (OLS) with same form and notation (not so clear distinction). The problem is that following Pearl literature you do not have to distinguish between “true SCM” and “hypothesized SCM”. You can intend the SCM (fully specified) as the “true DGP” and so show all the implications (theory). – markowitz May 23 '21 at 11:07
  • In practical context you have limited information and you encode all (your) causal assumptions on one SCM and work with it and his implications. Or at least this is my understandings about Pearl literature. Therefore your example show inconsistent causal assumptions. About my comment that you underscore, It come from my guess that the “imperfect model” stand for regression; if it is not so you can forget it. – markowitz May 23 '21 at 11:07
  • @markowitz, thank you, this is easier for me to follow. It has been a while since I read Pearl, and I forget so quickly... So it seems my "imperfect model" is an SCM then. We have the true SCM and the imperfect SCM. I estimate the latter by OLS and show that the OLS estimator of $\alpha_1$ does not converge to the true causal effect $\beta_1$. According to you, there is a fault in my argument. Could you elaborate a bit more on in which sense my causal assumptions are inconsistent? Is it that the true SCM differs from the imperfect SCM? I think this is only natural. – Richard Hardy May 23 '21 at 12:52
  • “Could you elaborate a bit more on in which sense my causal assumptions are inconsistent? Is it that the true SCM differs from the imperfect SCM?” Exactly. In my view you have to consider only the initial SCM; then you can estimate the regression that include the collider too and verify that $\alpha_1$ (regression coeff) do not identify $\beta_1$ (structural/causal coeff). Moreover you can show that, quite surprisingly, if you exclude $Z$ (collider) from the regression the new reg coeff (let me say $\alpha_3$) identify $\beta_1$. That's all about the example. – markowitz May 23 '21 at 13:03
  • Said that, I'm not sure that this example is what the OP looking for. Indeed I refrained mysel to write down an example because as I said in my first comment on this discussion, the OP question can open the door for very deep problems that do not permit short and exaustive answer like a short example suggest. – markowitz May 23 '21 at 13:05
  • @markowitz, I think I do what you say I should be doing, except that it is impossible to consider only the true SCM and estimate a regression that differs from it (because then we *are* considering another model, different from the true SCM, and estimating it; there can be no regression without a model). Due to the impossibility, I define the imperfect SCM and essentially do what you say. I also consistently maintain that the true SCM is true and the imperfect SCM is our model of the truth. I condition on truth where appropriate to obtain the results that I establish. I think this is correct. – Richard Hardy May 23 '21 at 14:08
  • @markowitz, on the other hand, it would be interesting to see your answer implementing what you yourself suggest. Then we would be able to see more clearly if there are any differences between what you suggest and what I did. It would be easier to discuss with some formulas and assumptions written down in a clear sequence. – Richard Hardy May 23 '21 at 14:10
  • Ok, I will follow your suggestion later adding the example in my answer. But now note that "there can be no regression without a model (SCM) " is not true. You can compute all possible regressions on data you have, regardless any SCM. The problem is another, without SCM is hard to justify what regression can help you for solve your causal query. – markowitz May 23 '21 at 21:35
  • @markowitz, OK. Yes, one can compute all possible regressions, but one cannot interpret them without having an SCM in mind, this is what I meant. – Richard Hardy May 24 '21 at 05:18
  • I edited my answer. – markowitz May 24 '21 at 06:20
  • @markowitz, I took a quick look and could not find where your answer differs from mine, aside from not showing why $\theta_2$ does not identify $\beta_1$ (quite a crucial omission, in my opinion) which my answer does but explaining what an SCM is in greater detail and including an equation for $X$ (a welcome addition). But let me finish the working day and take a closer look. – Richard Hardy May 24 '21 at 06:50
  • I added some details about “why” $Z$ is not a good control. I see that you added some detail about structural errors and it is good. However, as I already said, my main point is that you do not need two SCM (true and hypothetical); you need just one. Given it you can check the capability of all possible regressions about parameters identification, the erroneously use of collider as control variable emerge from that too. You mixed assumption about the “true SCM” and “hypothetical SCM” in order to “proof” the erroneously use of collider. – markowitz May 24 '21 at 14:27
  • This idea probably come from the use of the “true model” and “misspecified model” as used in several econometrics books. However, in my knowledge, this logic is not used in Pearl literature; what it seems needed refers on. I'm not a master about that but your strategy sound me problematic. Finally I cannot understand as you can consider Wooldridge 7ed as good ref for problem like that. Said that, fell free to remain your position and thanks for discussion. – markowitz May 24 '21 at 14:27
  • @markowitz, I appreciate your input and your patient and thorough help on the way towards a better understanding of the phenomena of interest. I will continue thinking and possibly refining my answer to address the concerns that I can understand and appreciate. Thank you sincerely for your help so far. – Richard Hardy May 24 '21 at 15:04
  • I have thought a bit more and see that (at least part of) your criticism is that I am not doing the analysis in Pearl's way. I suppose you are correct about that. However, that does not mean Pearl's type of analysis is the only that can be correct. Nor does it prove my take is correct, of course. I am using a mixture of Pearl and traditional econometrics *a la* Wooldridge. I am perhaps more comfortable with the latter but have included Pearl's SCM at the beginning to clarify what the causal truth is; I find Pearl's approach helpful there. After that I proceed in a more traditional 'metric way. – Richard Hardy May 24 '21 at 18:38
  • I guess this may be frowned upon from both sides, regardless of whether it is correct or not. Now, regarding why I used the non-Pearl approach is that I wanted to illustrate using the traditional approach how colliders violate traditional assumptions. I could probably have refrained from referring to SCM altogether to stay true to traditional econometrics, but that would have required additional thought on how to phrase the SCM unambiguously. I did not do everything in Pearl's approach since it may be less transparent and intuitive for those only exposed to traditional econometrics. – Richard Hardy May 24 '21 at 18:52
  • First of all, is a pleasure for me to talk with you about econometrics. “I am using a mixture of Pearl and traditional econometrics a la Wooldridge. … Now, regarding why I used the non-Pearl approach is that I wanted to illustrate using the traditional approach how colliders violate traditional assumptions.” I have econometrics background and I share your aspiration. Since some time I spent effort in this direction. – markowitz May 24 '21 at 19:49
  • However, today my hope to save “traditional econometrics” is close to zero. Surely some language translation can be made but I think that substantial acquisition from the literature of Pearl and his colleagues is unavoidable for the future of econometrics. Indeed the fact no one econometrics professors yet give a reliable reply to Chen and Pearl (2013) article is a bad news for supporters of “classical econometrics”. – markowitz May 24 '21 at 19:49
  • Said that, if you or someone else achieve reliable results I will happy! Let me know. Now, about your revised reply. Problems with SCM go away, but unfortunately others come from the regression side. These I spoken about at the start of our discussion: you pretend to use exogenety assumption on reg equation. The decomposition that you made seems me correct but, if without loss of generality we assume that data come from a joint Normal, the condition $E[\epsilon|X] \neq 0$ is false! It is only apparent. – markowitz May 24 '21 at 19:50
9

It is very easy to demonstrate that all the assumptions of OLS can be satisfied and yet collider bias persists.

Here, I generate data in which $z$ is a collider for the effect of $x$ on $y$.

library(tidyverse)

r = rerun(1000,{
  w = rnorm(100)
  u = rnorm(100)
  z = 3*u-w + rnorm(100, 0, 0.5)
  x = 2*w + rnorm(100, 0, 0.3)
  y = 5*x - u +  rnorm(100, 0, 0.75)
  
  mod1 = lm(y~x+w)
  mod2 = lm(y~x+z)
  
  tibble(`No Collider` = coef(mod1)['x'], `Collider` = coef(mod2)['x'])
}) %>% 
  bind_rows

Note all the assumptions of linear regression are satisfied:

i) Observations are iid ii) The functional form is correct iii) Homogeneity of variance, and iv) The likelihood is normal (though this is not as important, hence its place last...)

Plotting a 1000 replications of this experiment, we find that model 1 (which correctly blocks the effect of confounders "closing the back door") provides an unbiased estimate of the effect of $x$ on $y$. However, model 2 (which conditions on the colider) has a systematic bias resulting in an estimated effect of $x$ on $y$ which is smaller than the truth.

enter image description here

EDIT:

1)

, we can prove that the estimate for $\beta$ must be unbiased, that is, $E(\hat{\beta}) = \beta$.

The coefficients of the model are unbiased estimates sure, but the question becomes unbiased estimates of what? Whatever they are, they are not the unbiased estimates of the causal effect of $x$ on $y$.

2)

I do not think observations are iid is an OLS assumption

You are correct. The assumptions I've listed here would be the assumptions of a gaussian GLM which are stricter than OLS

Also, did you mean homogeneity (if so, what does it mean?)

I did mean homogeneity, but I meant homogeneity of variance not errors. I've fixed that. Homogeneity of variance is a simpler way of saying (or spelling) homoskedasticity.

3)

Can controlling for a collider "fool" us? Is it wrong to control for a collider? If so, why? Let's start from there

Yes, it can. This example demonstrates this. The real effect of changing x one unit is 5. The first model controlling for x and u (thereby blocking all backdoors from y to x) shows an unbiased estimate of 5. The model controlling for the collider produces an estimate of x's effect on y which is systematically lower than than 5.

The "why" of colliders is still a bit of a mystery to me. In the readings I've done, authors just say "the flow of information is blocked by a collider, but conditioning on the collider opens the back door" or something in that spirit. If you find a satisfactory explanation for why the collider bias happens, let me know.

4)

I don't think your model is well specified. In the population y is a function of x and u. Yet you are only controlling for x

What if u is some expensive measurement, or one we forgot to collect? We can't collect data on everything that effects the outcome. That being said, you're right to be suspicious of this. There are more formal ways of checking that the model you've written down is consistent with the data that involve checking conditional independence. You can find ways to test these implications here under "Testable Implications".

Demetri Pananos
  • 24,380
  • 1
  • 36
  • 94
  • What do you mean by "functional form is correct"? – cure May 22 '21 at 01:56
  • @cure The true expectation of the outcome is $E(y) = X\beta$ – Demetri Pananos May 22 '21 at 07:01
  • 1
    Honestly I'm not following your proof. If the functional form is correct and the model is strictly exogenous, we can prove that the estimate for $\beta$ must be unbiased, that is, $E(\hat{\beta}) = \beta$. So if all the assumptions of the OLS model are satisfied, I'm not sure where this bias comes from. – robertspierre May 22 '21 at 07:58
  • I do not think observations are iid is an OLS assumption, even though it is satisfied in this simulation. Random sampling is what is needed. Also, did you mean homogeneity (if so, what does it mean?) or homoskedasticity? Also, you may want to include no perfect multicollinearity into the list of assumptions. Exogeneity would be another assumption needed for causal interpretation though not for probabilistic optimality of OLS. – Richard Hardy May 22 '21 at 09:03
  • @RichardHardy For unbiasedness we need only (1) functional form is correct (2) strict exogeneity. But let's assume that all the OLS assumptions you list are satisfied. Can controlling for a collider "fool" us? Is it wrong to control for a collider? If so, why? Let's start from there :) – robertspierre May 22 '21 at 10:11
  • @DemetriPananos upon a more accurate read, I don't think your model is well specified. In the population `y` is a function of `x` and `u`. Yet you are only controlling for `x` – robertspierre May 22 '21 at 10:17
  • @robertspierre, it was not specified whether you care about all optimality properties of OLS or just a subset of these like unbiasedness, so I went for the whole package. – Richard Hardy May 22 '21 at 10:32
  • @robertspierre See my edits – Demetri Pananos May 22 '21 at 13:52
  • 3
    @robertspierre; your question can open problems greater than you can realize at the beginning. First of all, you speak about “exogeneity” as OLS assumption, and indeed several authoritative books make the same, but this is wrong. Exogeneity is a causal concept/assumption. Yours is the most common confusion between causal and statistical concepts. Read my answer and ref therein – markowitz May 22 '21 at 14:31
  • 1
    (+1) nice one @DemetriPananos . Similar things happen when looking at bias due to confounding, mediation, differential selection etc. The OLS estimates can of course be unbiased *for that particular model* but the problem is that the model is mis-specified if we wish to estimate the total causal effect of some exposure on an outcome. Of courses if we want just direct effects (eg in mediation), or we only care about prediction (eg many machine learning models) then much of this hardly matters. – Robert Long May 22 '21 at 14:41
  • Thank you for addressing my concerns. I find *homoskedasticity* preferable to *homogeneity of variance*. The former is unambiguous in a regression context while I am still not sure what homogeneity of variance technically means. If it is exactly the same as homoskedasticity, why invent a new term when there is a well-established one? If it means something else, what is it? Regarding assumptions of a Gaussian GLM, is iid observations really an assumption? It implies Xs are iid, so we are not allowed to set them at will as we would in a controlled experiment; is that so? – Richard Hardy May 22 '21 at 14:56
  • @RobertLong, good points. ["T-consistency vs. P-consistency"](https://stats.stackexchange.com/questions/265739) is a related post. – Richard Hardy May 22 '21 at 14:59
  • @RichardHardy I'm not interested in arguing semantics. It means the same as homoskedasticity, and it may or may not be popular in other circles. It seems to be sufficient in my circles, if it isn't in yours then you are free not to use it. IID observations for the gaussian glm is an assumption. It means that the observations of the *outcome* are independent of one another. The X are considered fixed and we make no assumptions about them or their distribution (except perhaps they are observed without error, which is always wrong). – Demetri Pananos May 22 '21 at 15:03
  • @robertspierre, model 2 violates the exogeneity assumption (which I mentioned in a comment above) which is invoked for establishing the consistency of OLS for the causal parameter rather than the "predictive" parameter of the conditional expectation function. – Richard Hardy May 22 '21 at 15:04
  • @DemetriPananos, I did not mean to argue. I am seeking clarity for improving my own understanding and hopefully assisting others at the same time. Now, If Xs are allowed to be fixed, then Ys are not iid because their conditional expectations vary with Xs (unless Xs are all equal to the same constant which is a degenerate case). Thus I think it would be OK to say that either Y|X are iid or $\varepsilon$ ar iid but not observations are iid because neither Y not X are iid, and observations are pairs (Y,X). – Richard Hardy May 22 '21 at 15:09
  • 1
    @RichardHardy You are correct in saying the observations conditional on the predictors are iid, and that is indeed the assumption. The language of regression generates confusion for many people, across many disciplines, new and old. – Demetri Pananos May 22 '21 at 15:12
  • Thank you. This is exactly why I am interested in precise wording. Imprecise wording leads all too easily to confusion and false claims. I do agree that language differences may generate confusion for people from different fields, but imprecise wording will generate confusion even inside a field. – Richard Hardy May 22 '21 at 15:13
  • "conditioning on the collider opens the back door"—that's exactly right. It's easy to see on a causal diagram. Consider A and B cause C and D, and also C causes D. We want to know know the effect of A on D. If we don't control for C, then we can estimate P(D | A), and all is well. But if we control for C, then there is information flow down from A through C and up to B, and back down to D—and this information is nowhere in the model. This is why controlling for too many variables is bad. – Neil G May 24 '21 at 06:58
  • (Off-topic, I will never cease to be amazed by Hadley Wickham's ability to think up new names for things which already exist in R!) – Flounderer May 24 '21 at 09:22
  • 1
    Please consider using base R, & commenting it extensively, when illustrating posts here w/ R code. Not everyone who will come to this page will be familiar w/ R, & not all of those will be able to read tidy-code. – gung - Reinstate Monica May 24 '21 at 14:44
7

The problem here is that "collider" is a causal concept while OLS regression not necessarily deal with causality. About "regression and causality" read here: Under which assumptions a regression can be interpreted causally?

If we intend OLS regression as an estimator of the linear CEF, collider and other causal problems not matters. Read here: Regression and the CEF

Moreover, unfortunately, several books are ambiguous if not erroneous about the meaning of regression, especially about his possible causal use (read here: How would econometricians answer the objections and recommendations raised by Chen and Pearl (2013)?)

EDIT: following the discussion with Richard Hardy I add here the same example revised in my perspective:

The structural causal model (SCM) is \begin{aligned} Y&=\beta_1X+u_Y, \\ Z&=\beta_2X+\beta_3Y+u_Z \\ X&=u_X \end{aligned} so that $Z$ is a collider: $X\rightarrow Z\leftarrow Y$.

structural errors can be considered the exogenous variables in the system and we assume them as zero mean and independent each others. Note that one implication of that is: $E[u_Y|X]=0$, $E[u_Z|X,Y]=0$. Note that, in general, the SCM encode (explicitly) all causal assumptions made by the researcher.

Now the question is that we are interested in the causal effect of $X$ on $Y$, then we looking for the regression equation that permit us to identify $\beta_1$; note that this is the direct causal effect of interest, and in this particular case it is the total too (assumption).

The reply is quite simple, because from this regression

$Y=\theta_1X+r_1$

$\theta_1$ identify $\beta_1$

now, in general for identification of the causal effect of interest the regression above is not what we need ($\theta_1$ do not identify the effect of interest). We have to add some control variables. Now the original question is (more or less): why control for collider is not a good idea?

In our example we can try to add the collider as control and compute the regression as follow:

$Y=\theta_2X+\theta_3Z+r_2$

but $\theta_2$ NOT identify $\beta_1$. It is so because the admissible control sets have to comply with backdoor criterion; so, $[Z]$ is not among them while the empty set is. So, including $Z$ (collider) is a bad idea. Worse, this regression do not identify any causal effect implied by the SCM. Indeed not all regressions can help in causal inference.

For other example in the same fashion you can see:

Infer one link of a causal structure, from observations

Endogenous controls in linear regression - Alternative approach?

Said that, I don't know if this example is what the asker looking for. The problem is deeper. The so called "OLS assumptions" play some role above?

This can be matter of debate. I wrote a lot about that in this site: see links above, and links therein. However my short answer is: NO. This because "OLS assumptions", wherever presented, not include any clear causal assumptions.

markowitz
  • 3,964
  • 1
  • 13
  • 28
  • (+1) This is very interesting. I think I follow where you differ wth @RichardHardy, but some of that is in comments to his answer. So for completeness, and the benefit of others, would you be able to summarise, in your answer, where you disagree with Richard ? – Robert Long May 24 '21 at 14:34
  • @Robert Long; my main point was that we do not need two SCM (true and hypothetical); we need just one; see comments for more details. However Richard revised his reply in the direction that I suggested … even if I have others concern yet (read comments). – markowitz May 24 '21 at 19:57