32

First, don't panic. Yes, there are many similar question on this site. But I believe none gives a conclusive answer to the question below. Please bear with me.


Consider a data generation process $\text{D}_X(x_1, ... , x_n|\theta)$, where $\text{D}_X(\cdot)$ is a joint density function, with $n$ variables and parameter set $\theta$.

It is well known that a regression of the form $x_n = f(x_1, ... , x_{n-1}|\theta)$ is estimating a conditional mean of the joint distribution, namely, $\text{E}(x_n|x_1,...,x_{n-1})$. In the specific case of a linear regression, we have something like

$$ x_n = \theta_0 + \theta_1 x_1 + ... + \theta_{n-1}x_{n-1} + \epsilon $$

The question is: under which assumptions of the DGP $\text{D}_X(\cdot)$ can we infer the regression (linear or not) represents a causal relationship?

It is well known that experimental data does allow for such interpretation. For what I can read elsewhere, it seems the condition required on the DGP is exogeneity:

$$ \text{E}(x_1, ... x_{n-1}|\epsilon) = 0$$

The nature of randomisation involved in experimental data ensures the above is the case. The story goes then to argue why observational data normally fails in achieving such condition, reasons including omitted variable bias, reverse causality, self-selection, measurement errors, and so on.

I am however uncertain about this condition. It seems too weak to encompass all potential arguments against regression implying causality. Hence my question above.

UPDATE: I am not assuming any causal structure within my DGP. I am assuming that the DGP is complete, in the sense that there must be some causality (an ontological position it could well be debated), and all relevant variables are included. The key is to identify the set of assumptions which ensure me causality goes from certain variables to the other, without assuming from the outset such direction of causality.


Many similar posts on the site spend time mentioning why correlation does not imply causation, without providing hard arguments about when it does. That is the case, for instance, of this very popular post. Additionally, in the most popular post on the site about the topic, the accepted answer gives the very vague answer:

Expose all hidden variables and you have causation.

I do not know how to translate that to my question above. Neither is the second most upvoted answer. And so on. That is why I believe this post does not have an answer elsewhere.

Alexis
  • 26,219
  • 5
  • 78
  • 131
luchonacho
  • 2,568
  • 3
  • 21
  • 38
  • 1
    Hm... I think that exogeneity per se is quite inefficient. My understanding is closer to that the RCT (or at least the approximation of it) balances observed and unobserved confounders. (It is also relevant to mention positivity (in the form of common support) but I find this is "second" to unconfoundedness) – usεr11852 Oct 22 '20 at 21:12
  • @usεr11852 What do you mean by inefficient? That there is a weaker set of assumptions that yields causal interpretation? – luchonacho Oct 22 '20 at 21:41
  • Apologies, I wanted to type insufficient and probably it was auto-corrected to inefficient. Apologies for any confusion caused. My comment is that I don't think exogeneity on its own is enough. :) – usεr11852 Oct 22 '20 at 22:23
  • Aside: Linear regression does not encode a direction of causality but there is an asymmetry between the dependent and independent variables unless one is using total least squares. Using linear least squares amounts to a presumption that there is no error to be found in the values of the IV, i.e. only error in the DV should make up the residuals. In other words the researcher is assuming that their measurements of the IV are "perfect" or determined by them with complete precision, while any measurement or model error lives in the DV – benxyzzy Oct 23 '20 at 06:02
  • You may be interested in http://www2.nber.org/WNE/WNEnotes.pdf and specifically Assumption 1 (Unconfoundedness) in that PDF, as well as https://stats.stackexchange.com/questions/182222/unconfoundedness-in-rubins-causal-model-laymans-explanation. – Adrian Oct 24 '20 at 23:40
  • Do you want to interpret all $n-1$ parameters causally? – dimitriy Oct 27 '20 at 16:26
  • @DimitriyV.Masterov Good question. In principle, yes. Do you ask because [mean dependence](https://en.wikipedia.org/wiki/Mean_dependence) can be used to relax the assumptions? – luchonacho Oct 27 '20 at 17:47
  • @Christoph The point is to find the most agnostic set of assumption on the DGP that ensures a regression can be interpreted causally. Is it exogeity? That is one claim often made. A mathematical proof is a great proof. Cannot think of another type of proof. – luchonacho Oct 27 '20 at 18:19
  • @Christoph We are used to think in terms of regression models (conditional mean models), but these are always part of a DGP with a joint distribution. The point of referring to the DGP is to go at the core of the "problem" and think in terms of the assumptions of the DGP that allow us to rule out alternative directions of causality in a regression, which are always a risk. – luchonacho Oct 27 '20 at 19:24
  • To me, this was all not clear ;-) – Christoph Oct 28 '20 at 07:50
  • My first intuition when reading the question was that unless you have interventions, you can't. Forget it, full stop. After reading all the sophisticated and intelligent answers with which I have no issues, I still think the straight answer is just, "you can't". – Christian Hennig Oct 31 '20 at 10:59
  • 1
    @Lewian, even with an intervention you have to assume that all the other variables that could be affecting the dependent variable are fixed while you are intervening. Otherwise, something else might have happened simultaneously with your intervention that could mask the effect of the intervention. So in my view, causality is always coming from an assumption. Of course, in an experimental setting you can make the assumption more plausible than otherwise, but strictly speaking the assumption is unavoidable. (What if there is an evil spirit that affects $Y$ whenever you are intervening onto $X$?) – Richard Hardy Oct 31 '20 at 11:10
  • 1
    @Richard if we are reading Lewian's comment in it's narrow sense than he is strictly saying that intervention is a *necessary* condition and not a *sufficient* condition. – Sextus Empiricus Oct 31 '20 at 15:32
  • Wow, the response has been amazing. Don't know what to do really! Will have to think it hard, or assing another bounty afterwards. – luchonacho Nov 02 '20 at 23:59

6 Answers6

15

I made efforts in this direction and I feel myself in charge to give an answer. I written several answers and questions about this topic. Probably some of them can help you. Among others:

Regression and causality in econometrics

conditional and interventional expectation

linear causal model

Structural equation and causal model in economics

regression and causation

What is the relationship between minimizing prediction error versus parameter estimation error?

Difference Between Simultaneous Equation Model and Structural Equation Model

endogenous regressor and correlation

Random Sampling: Weak and Strong Exogenity

Conditional probability and causality

OLS Assumption-No correlation should be there between error term and independent variable and error term and dependent variable

Does homoscedasticity imply that the regressor variables and the errors are uncorrelated?

So, here:

Regression and Causation: A Critical Examination of Six Econometrics Textbooks - Chen and Pearl (2013)

the reply to your question

Under which assumptions a regression can be interpreted causally?

is given. However, at least in Pearl opinion, the question is not well posed. Matter of fact is that some points must be fixed before to “reply directly”. Moreover the language used by Pearl and its colleagues are not familiar in econometrics (not yet).

If you looking for an econometrics book that give you a best reply … I have already made this work for you. I suggest you: Mostly Harmless Econometrics: An Empiricist's Companion - Angrist and Pischke (2009). However Pearl and his colleagues do not consider exhaustive this presentation neither.

So let me try to answer in most concise, but also complete, way as possible.

Consider a data generation process $\text{D}_X(x_1, ... , x_n|\theta)$, where $\text{D}_X(\cdot)$ is a joint density function, with $n$ variables and parameter set $\theta$. It is well known that a regression of the form $x_n = f(x_1, ... , x_{n-1}|\theta)$ is estimating a conditional mean of the joint distribution, namely, $\text{E}(x_n|x_1,...,x_{n-1})$. In the specific case of a linear regression, we have something like $$ x_n = \theta_0 + \theta_1 x_1 + ... + \theta_{n-1}x_{n-1} + \epsilon $$
The question is: under which assumptions of the DGP $\text{D}_X(\cdot)$ can we infer the regression (linear or not) represents a causal relationship? ... UPDATE: I am not assuming any causal structure within my DGP.

The core of the problem is precisely here. All assumptions you invoke involve purely statistical informations only; in this case there are no ways to achieve causal conclusions. At least not in coherently and/or not ambiguous manner. In your reasoning the DGP is presented as a tools that carried out the same information that can be encoded in the joint probability distribution; no more (they are used as synonym). The key point is that, as underscored many times by Pearl, causal assumptions cannot be encoded in a joint probability distribution or any statistical concept completely attributable to it. The root of the problems is that joint probability distribution, and in particular conditioning rules, work well with observational problems but cannot facing properly the interventional one. Now, intervention is the core of causality. Causal assumptions have to stay outside distributional aspects. Most econometrics books fall in confusion/ambiguity/errors about causality because the tools presented there do not permit to distinguish clearly between causal and statistical concepts.

We need something else for pose causal assumptions. The Structural Causal Model (SCM) is the alternative proposed in causal inference literature by Pearl. So, DGP must be precisely the causal mechanism we are interested in, and our SCM encode all we know/assume about the DGP. Read here for more detail about DGP and SCM in causal inference: What's the DGP in causal inference?

Now. You, as most econometrics books, rightly invoke exogeneity, that is a causal concept:

I am however uncertain about this condition [exogeneity]. It seems too weak to encompass all potential arguments against regression implying causality. Hence my question above.

I understand well your perplexity about that. Actually many problems move around "exogeneity condition". It is crucial and it can be enough in quite general sense, but it must be used properly. Follow me.

Exogeneity condition must be write on a structural-causal equation (error), no others. Surely not on something like population regression (genuine concept but wrong here). But even not any kind of “true model/DGP” that not have clear causal meaning. For example, no absurd concept like "true regression" used in some presentations. Also vague/ambiguous concepts like "linear model" are used a lot, but are not adequate here.

No more or less sophisticated kind of statistical condition is enough if the above requirement is violated. Something like: weak/strict/strong exogeneity … predetermiteness … past, present, future … orthogonality/scorrelation/independence/mean independence/conditional independence .. stochastic or non stochastic regressors .. ecc. No one of them and related concepts is enough if them are referred on some error/equation/model that do not have causal meaning since origin. You need structural-causal equation.

Now, you and some econometrics books, invoke something like: experiments, randomization and related concepts. This is one right way. However it can be used not properly as in Stock and Watson manual case (if you want I can give details). Even Angrist and Pischke refers on experiments but them introduce also structural-causal concept at the core of their reasoning (linear causal model - chapter 3 pag 44). Moreover, in my checks, them are the only that introduce the concepts of bad controls. This story sound like omitted variables problem but here not only correlation condition but also causal nexus (pag 51) are invoked.

Now, exist in literature a debate between "structuralists vs experimentalists". In Pearl opinion this debate is rhetorical. Briefly, for him structural approach is more general and powerful … experimental one boil down to structural. Indeed structural equations can be viewed as language for coding a set of hypothetical experiment.

Said that, direct answer. If the equation:

$$ x_n = \theta_0 + \theta_1 x_1 + ... + \theta_{n-1}x_{n-1} + \epsilon $$

is a linear causal model like here: linear causal model

and the exogeneity condition like $$ \text{E}[\epsilon |x_1, ... x_{n-1}] = 0$$ hold.

Then a linear regression like:

$$ x_n = \beta_0 + \beta_1 x_1 + ... + \beta_{n-1}x_{n-1} + v $$

has causal meaning. Or better all $\beta$s identifies $\theta$s and them have clear causal meaning (see note 3).

In Angrist and Pischke opinion, model like above are considered old. Them prefer to distinguish between causal variable(s) (usually only one) and control variables (read: Undergraduate Econometrics Instruction: Through Our Classes, Darkly - Angrist and Pischke 2017). If you select the right set of controls, you achieve a causal meaning for the causal parameter. In order to select the right controls, for Angrist and Pischke you have to avoid bad controls. The same idea is used even in structural approach, but in it is well formalized in the back-door criterion [reply in: Chen and Pearl (2013)]. For some details on this criterion read here: Causal effect by back-door and front-door adjustments

As conclusion. All above says that linear regression estimated with OLS, if properly used, can be enough for identification of causal effects. Then, in econometrics and elsewhere are presented other estimators also, like IV (Instrumental Variables estimators) and others, that have strong links with regression. Also them can help for identification of causal effects, indeed they were designed for this. However the story above hold yet. If the problems above are not solved, the same, or related, are shared in IV and/or other techniques.

Note 1: I noted from comments that you ask something like: "I have to define the directionality of causation?" Yes, you must. This is a key causal assumption and a key property of structural-causal equations. In experimental side, you have to be well aware about what is the treatment variable and what the outcome one.

Note 2:

So essentially, the point is whether a coefficient represents a deep parameter or not, something which can never ever be deduced from (that is, it is not assured alone by) exogeneity assumptions but only from theory. Is that a fair interpretation? The answer to the question would then be "trivial" (which is ok): it can when theory tells you so. Whether such parameter can be estimated consistently or not, that is an entirely different matter. Consistency does not imply causality. In that sense, exogeneity alone is never enough.

I fear that your question and answer come from misunderstandings. These come from conflation between causal and puerely statistical concepts. I’m not surprise about that because, unfortunately, this conflation is made in many econometrics books and it represent a tremendous mistake in econometrics literature.

As I said above and in comments, the most part of mistake come from ambiguous and/or erroneous definition of DGP (=true model). The ambiguous and/or erroneous definition of exogeneity, is a consequence. Ambiguous and/or erroneous conclusion about the question come from that. As I said in comments, the weak points of doubled and Dimitriy V. Masterov answers come from these problems.

I starting to face these problems years ago, and I started with the question: “Exogeneity imply causality? Or not? If yes, what form of exogeneity is needed?” I consulted at least a dozen of books (the more widespread were included) and many others presentations/articles about the points. There was many similarities among them (obvious) but to find two presentations that share precisely the same definitions/assumptions/conclusions was almost impossible.
From them, sometimes seemed that exogenety was enough for causality, sometimes not, sometimes depend from the form of exogeneity, sometimes nothing was said. As resume, even if something like exogeneity was used everywhere, the positions moved from “regression never imply causality” to “regression imply causality”. I feared that some counter circuits was there but … only when I encountered the article cited above, Chen and Pearl (2013), and Pearl literature more in general, I realized that my fear were well founded. I’m econometrics lover and felt disappointment when realized this fact. Read here for more about that: How would econometricians answer the objections and recommendations raised by Chen and Pearl (2013)?

Now, exogeneity condition is something like $E[\epsilon|X]=0$ but is meaning depend crucially on $\epsilon$. What it is?

The worst position is that it represent something like “population regression error/residual” (DGP=population regression). If linearity is imposed also, this condition is useless. If not, this condition impose a linearity restriction on the regression, no more. No causal conclusions are permitted. Read here: Regression and the CEF

Another position, the most widespread yet, is that $\epsilon$ is something like “true error” but the ambiguity of DGP/true model is shared there too. Here there are the fog, in many case almost nothing is said … but the usual common ground is that it is a “statistical model” or simply a “model”. From that, exogeneity imply unbiasedness/consistency. No more. No causal conclusion, as you said, can be deduced. Then, causal conclusions come from “theory” (economic theory) as you and some books suggest. In this situation causal conclusions can arrive only at the end of the story, and them are founded on something like an, foggy, "expert judgement". No more. This seems me unsustainable position for econometric theory. This situation is inevitable if, as you (implicitly) said, exogeneity stay in statistical side … and economic theory (or other fields) in another.

We must to change perspective. Exogeneity is, also historically, a causal concept and, as I said above, must be a causal assumption and not just statistical one. Economic theory is expressed also in term of exogeneity; them go together. In different words, the assumptions that you looking for and that permit us causal conclusion for regression, cannot stay in regression itself. These assumption must stay outside, in a structural causal model. You need two objects, no just one. The structural causal model stand for theoretical-causal assumptions, exogeneity is among them and it is needed for identification. Regression stand for estimation (under other pure statistical assumption). Sometimes Econometric literature don't distinguish clearly between regression and true model neither, sometimes the the distinction is made but the role of true model (or DGP) is not clear. From here the conflation between causal and statistical assumptions come from; first of all an ambiguous role for exogeneity.

Exogeneity condition must be write on structural causal error. Formally, in Pearl language (formally we need it) the exogeneity condition can be written as:

$E[\epsilon |do(X)]=0$ that imply

$E[Y|do(X)]=E[Y|X]$ identifiability condition

in this sense exogeneity imply causality.

Read also here: Random Sampling: Weak and Strong Exogenity

Moreover in this article: TRYGVE HAAVELMO AND THE EMERGENCEOF CAUSAL CALCULUS – Pearl (2015). Some of the above points above are treated.

For some take away of causality in linear model read here: Linear Models: A Useful “Microscope” for Causal Analysis - Pearl (2013)

For an accessible presentation of Pearl literature read this book: JUDEA PEARL, MADELYN GLYMOUR, NICHOLAS P. JEWELL - CAUSAL INFERENCE IN STATISTICS: A PRIMER http://bayes.cs.ucla.edu/PRIMER/

Note 3: More precisely, is needed to say that $\theta$s surely represent the so called direct causal effects, but without additional assumptions is not possible to say if they represent the total causal effects too. Obviously if there are confusion about causality at all is not possible to address this second round distinction.

markowitz
  • 3,964
  • 1
  • 13
  • 28
  • *all $\beta$s represent unbiased/consistent estimators*: estimators are functions of the sample data, while your $\beta$s do not seem to be such. – Richard Hardy Oct 27 '20 at 19:55
  • My explanation is already quite long, I skipped many details and maybe some imprecisions remains. I modified that part with the concept of "identification". Now sound better. – markowitz Oct 28 '20 at 08:56
  • "Indeed, even if at general level I agree with his explanation, I fear that the answer of Sextus Empiricus have here his weak point" What does this sentence mean? – Sextus Empiricus Oct 31 '20 at 14:53
  • @SextusEmpiricus; It has more or less the same meaning of my last comments on your answer. I noted that you had modified it, and I reread carefully all you have written. Your last part is important and only there you talk about exogeneity in explicit way. My doubts is about this point. I read several time this part and I could misinterpreted it, but it seems me that you consider the definition of $\epsilon$, therefore that of the DGP, and exogeneity as unrelated things. Therefore seems that you denied any causal role for exogeneity. This position seems me wrong. – markowitz Oct 31 '20 at 15:16
  • My point: It seems me that you denied any causal role for exogeneity. Then you consider that as pure statistical assumption. It is so? – markowitz Oct 31 '20 at 15:23
  • I considered two cases for $\epsilon$ because the OP is not so clear about $\epsilon$. In one case it can relate to the variability in a statistical model, in the other case it can relate to the variability in a 'true' model (mechanistic model, DGP, whatever you wish to call it). With my answer I did not meant to deny anything and I am open to whatever interpretation is given to $\epsilon$. – Sextus Empiricus Oct 31 '20 at 15:43
  • “I considered two cases for $\epsilon$ because the OP is not so clear about $\epsilon$”. I agree with your first case, but, as I already said, It seems me that the OP question is clear enough and fall in this case. Then you have written “With my answer I did not meant to deny anything [causal role for exogeneity] and I am open to whatever interpretation is given to $\epsilon$ … [and] … I am saying that I agree that this exogeneity condition is useful (but I do not see it as sufficient assumption to imply causality)”; square brackets are mine. – markowitz Oct 31 '20 at 18:17
  • Ok, this is enough for me in order to delete my reference to your answer in mine. However your position about exogeneity seems me unclear (note that the same happen in many presentations/books), the clarity about the relation between causality and regression suffer from that. – markowitz Oct 31 '20 at 18:17
  • Wow, that is a very lengthy answer! Perhaps you might want to add a TL;DR (if that thing is possible here) for the busy reader. – luchonacho Oct 31 '20 at 23:10
  • @luchonacho; My answer is quite long, it’s true. However matter of fact is that your question is far from easy and very slippery. I think that for these reason several Masters made confusion about that. I tried to touch most important things. For busy reader I suggest to skip the notes, comments and ref. However I say you that spent some minutes more here is not a waste of time if you want to bridge the gap between usual econometrics and modern causal inference. – markowitz Nov 01 '20 at 01:28
  • Indeed if you want I can give you a short answer that people confident with (Pearl) causal inference can give us: “regression carried out causal meaning if the backdoor criterion is satisfied”. This answer is correct but It can sound obscure for many people, especially if these come from econometrics side. – markowitz Nov 01 '20 at 08:30
  • 1
    Ok, so let's wrap this up. You are essentially saying that people use the assumption "exogeneity" with laxness, whereas you say it should always be in the context of a structural model (theory-based) and thus it does imply causality. That is, what some economists term as exogeneity (conditional mean of error = 0) should not be called so unless error is from a structural model. The conclusion is then that without knowledge of the structural model we cannot deduce, by mere statistical assumption on whatever error we have, that conditional mean = 0 implies causality. Is that a decent summary? – luchonacho Nov 03 '20 at 20:20
  • Only one little precisation: “…thus it [exogeneity] does imply causality” I prefer to say so: “... then exogeneity is one causal assumption, among others, it is need for identification of causal effects with regression”. The rest seems me completely ok. It seems me that you get it! – markowitz Nov 03 '20 at 21:15
9

Here's a partial answer for when the underlying model is actually linear. Suppose that the true underlying model is $$Y = \alpha + \beta X + v.$$

I'm making no assumptions about $v$, though we have that $\beta$ is THE effect of $X$ on $Y$. A linear regression for $\beta$, which we will denote as $\tilde{\beta}$ is simply just a statistical relationship between $Y,X$ and we have $$\tilde{\beta} = \frac{cov(Y,X)}{var(X)}.$$

So one already 'cheap' answer (which you've mentioned already) is that a linear regression identifies a causal effect when the covariance corresponds to a causal effect and not just a statistical relationship. But let's try to do a bit better.

Focusing on the covariance, we have \begin{align*} cov(Y,X) & = cov(\alpha + \beta X + v, X)\\ & = \beta cov(X,X) + cov(v,X) \\ & = \beta var(X) + cov(v,X), \end{align*}

and so dividing by the variance of $X$, we get that $$ \tilde{\beta} = \beta + \frac{cov(v,X)}{var(X)}.$$

We need $cov(v,X) = 0$ for $\tilde{\beta} = \beta$. We know that $$cov(v,X) = E[vX] - E[v]E[X],$$ and we need that to be zero, which is true if and only if $E[vX] = E[v]E[X]$, which is true if and only if $v$ and $X$ are uncorrelated. A sufficient condition for this is mean independence similar to what you wrote: i.e. that $E[X|v] = E[X]$, so that $E[vX] = E[E[X|v]v] = E[X]E[v]$ (alternatively, you could let $v' = v - E[V]$ and require $E[v'|X]= 0$ so that $E[v'X] - E[v']E[X] = 0$ which is typically done in regression analysis). All the 'intuitive' language you quote from other posts are various ways to think concretely of such assumptions holding in application. Depending on the field, the terms and concepts and approaches will all differ, but they are all trying to get these kind of assumptions to hold.

Your comment also made me realize that it's important to really stress my assumption of "the true underlying model." I am defining $Y$ as I did. In many situations, we may not know what $Y$ is, and depending on the field, this is precisely why things get 'less rigorous' in some sense. Because you're no longer taking the model specification itself for granted. In some fields such as causal inference in statistics, you could think of these issues using DAGs or the idea of d-separation. In others, such as economics, you could start with a model of how individuals or firms behave and back out a true model through that approach, and so on.

As a final side note, note that in this case, the conditional mean independence assumption is stronger than what you need (you 'just' need the covariance to be zero). This stems from the fact that I specified a linear relationship, but it should be intuitive that imposing less structure on the model and departing from a linear regression will need stronger assumptions even closer to the notion of the error term being mean independent (or fully independent) of $X$ for you to get a causal effect (which also becomes trickier to define.. one approach could be to think of the partial of $Y$ wrt $X$).

doubled
  • 4,193
  • 1
  • 10
  • 29
  • 4
    Thanks. Just to give one test to your solution. How does reverse causality violates your sufficient condition? In other words, where is Y causing X ruled out in such an assumption? – luchonacho Oct 22 '20 at 19:24
  • That's a great point (I didn't fully think about it when I wrote this), and it's being lumped into the statement at the start that I'm considering a "true underlying model." Y is not some other variable, its precisely the random variable defined as \alpha + \beta X + v. So if Y was rain and X was using an umbrella, in the true state of the world, it would (presumably!) have to be that \beta = 0. Clearly, if you went out and collected data (Y,X) there'd be correlation, but that's due to some time component that's missing, or weather channel predictions, etc, which all reside in v. – doubled Oct 22 '20 at 19:47
  • 1
    @luchonacho I also updated a quick paragraph at the end based on your comment... I think a lot of the "looser" terms you encounter are when you depart from assuming a true underlying model, or how to build this underlying true model. – doubled Oct 22 '20 at 19:53
  • I purposely did not state the conditional expectation to be a true causal relation. That would be assuming the consequent. My proposal is more humble. Let's assume a joint distribution of variables, a DGP, without saying **anything** about the causal structure of it. We know you can produce any regression (conditional expectation) from it. It could be $x_5$ on all the rest, for instance. The question is, with such agnostic approach, what is enough to imply causality in one particular regression? Explicitly assumimg Y does not cause X is not the idea. Anyone can assume that of their data. – luchonacho Oct 22 '20 at 21:25
  • Let me elaborate. We can distinguish two things. One, a consistently estimated regression. AFAIK, any conditional mean from a well defined joint distribution can be estimated consistently. That is, you can estimate consistently the relation between ice cream sales and weather, just as you can do so for the inverse (assuming the two are the only elements in the real joint distribution). Two, a consistently estimated regression which represents a causal structure (most likely from weather to ice cream sales). The Q is, which assumption can precisely differentiate between these two options. – luchonacho Oct 22 '20 at 21:36
  • 1
    @doubled; your explanation look like what I encountered in some econometric books. It seems me that the main problem go around your definition of DGP = “true underlying model”; clear causal meanings are absent. Then, your conclusion go around $Cov(X,v)$, sometimes/elsewhere presented as weak form for exogeneity. If the true model is not causal your explanation hold yet, but coefficients do not have causal meaning. Infact the same story are presented in many places, but authors sometimes say than causal interpretation is permitted and sometimes not. Why? .. main problems is about DGP meaning. – markowitz Oct 28 '20 at 12:46
  • 3
    @luchonacho if the DGP is a structural causal model reverse causality problems go away by construction. Then, you say “Let's assume a joint distribution of variables, a DGP, without saying anything about the causal structure of it … The question is, with such agnostic approach, what is enough to imply causality in one particular regression?” With this premises, as said in my answer, causal conclusions cannot be achieved properly. – markowitz Oct 28 '20 at 12:48
  • 1
    “One, a consistently estimated regression. AFAIK, any conditional mean from a well defined joint distribution can be estimated consistently.” Exactly, and if you remain focused only on “regressions” no causal conclusion can be achieved properly. You need a causal model (read Carlos Cinelli answer here: https://stats.stackexchange.com/questions/211008/dox-operator-meaning). – markowitz Oct 28 '20 at 12:48
  • 1
    Just for avoid another source of ambiguity. Some people think that “conditional mean” and “regression” are different things, at least sometimes. No, them are synonyms. About “linear regression” the linear approximation argument can/need to be invoked, but “conditional mean” and “regression” remain synonyms. Read here: https://stats.stackexchange.com/questions/481056/regressions-population-parameters/481082#481082 – markowitz Oct 28 '20 at 13:42
  • @markowitz So essentially, the point is whether a coefficient represents a deep parameter or not, something which can never ever be deduced from (that is, it is not assured alone by) exogeneity assumptions but only from theory. Is that a fair interpretation? – luchonacho Oct 29 '20 at 21:38
  • @markowitz The answer to the question would then be "trivial" (which is ok): it can when theory tells you so. Whether such parameter can be **estimated** consistently or not, that is an entirely different matter. Consistency does not imply causality. In that sense, exogeneity alone is never enough. – luchonacho Oct 29 '20 at 21:42
  • @luchonacho; I fear that your question and answer come from misunderstandings. Them cannot be truly solved from short answer. I added a large note on my answer. It is exhaustive I hope. – markowitz Oct 30 '20 at 09:40
  • Sorry for the slow answer, I missed this. I certainly agree that my answer is at best a partial answer (hence my prefacing of that), and especially given the focus on causality without imposing it on the DGP, it wasn't helpful to assume away the problem. Part of my goal in my answer was to illustrate that purely statistical concepts may not be able to answer this question (hence my before-last paragraph), and though I tried to write a nicer answer, I never found the time, and your answer is certainly far nicer than what I would have written. – doubled Nov 15 '20 at 16:03
4

The question is: under which assumptions of the DGP $\text{D}_X(\cdot)$ can we infer the regression (linear or not) represents a causal relationship?

It is well known that experimental data does allow for such interpretation. For what I can read elsewhere, it seems the condition required on the DGP is exogeneity:

$$ \text{E}(x_1, ... x_{n-1}|\epsilon) = 0$$

Regression by itselve can not be interpreted causaly. Indeed 'correlation ≠ causation'. You can see this with the correlated data in the image below. The image is symmetric (the pairs x,y follow a bivariate normal distribution) and regression does not tell whether Y is caused by X or vice versa.

The regression model can be interpreted as representing a causal relationship when the causality is explicitly part of the related data generating process. This is for instance the case when the experimenter performs an experiment where a variable is controlled/changed by the experimenter (and the rest is kept the same, or assumed to be the same), for instance, a 'treatment study', or in an observational study when we assume there is an 'instrumental variable'.

So it is explicit assumptions about causality in the DGP that make a regression relate to a causal relationship. And not situations where the data follows a certain relationship like $\text{E}(x_1, ... x_{n-1}|\epsilon) = 0$

symmetry

About the condition $\text{E}(x_1, ... x_{n-1}|\epsilon) = 0$

I believe this should be $\text{E}(\epsilon | x_1, ... x_{n-1}) = 0$. The $\text{E}(x_1, ... x_{n-1}|\epsilon) = 0$ is already easily violated when all $x_i>0$, or if you use standardized data then it is violated when there's heteroscedasticity. Or maybe you switched the meaning of X|Y as conditional on X instead of conditional on Y?

The condition on it's own does not assure that your regression model is to be interpreted causally. In the above example (the image) you can use a regression $x_1 = x_2 +\epsilon$ or $x_2 = x_1 +\epsilon$ and for both cases the condition is true (can be assumed to be true), but that does not make it a causal relationship, at least one (possibly both) of the two regressions can not be interpreted causally.

It is the assumption of the linear model as causal that is the key factor in assuring you that the regression model can be interpreted causally. The condition is necessary when you wish to ensure that the estimate of a parameter in a linear model relates completely to the causal model and not partially to the noise and confounding variables as well. So yes, this condition is related to an interpretation of regression as a causal model, but this interpretation starts with an explicit assumption of a causal mechanism in the data generating process.

The condition is more related to ensuring that the causal effect (whose effect size is unknown) is properly estimated by an ordinary least squares regression (ensure there's no bias), but the condition is not related to a sufficient condition that turns a regression into a causal model.

Maybe the $\epsilon$ referring to some true error in a theoretical/mechanistic/ab-initio model (e.g. some specific random process that creates the noise term like dice rolls, particle counts in radiation, vibration of molecules, etc.)? Then the question might be a bit semantic. If you are defining an $\epsilon$ that is the true error in a linear model, then you are implicitly defining the statistical model as equal to the model that is the data generating process. Then it is not really the exogeneity condition that makes that the linear regression can be interpreted causally, but instead the implicit definition/interpretation of $\epsilon$.

Sextus Empiricus
  • 43,080
  • 1
  • 72
  • 161
  • Thanks. Are you saying that exogeneity is not enough to deduce causality because it does not imply a causal structure? Would exogeneity + "theory telling us our regression represents a causal process" be sufficient? Is that the weakest set of assumptions, in the sense of 1) you are after a causal relation, 2) you estimated it correctly? – luchonacho Oct 28 '20 at 12:41
  • @luchonacho I am saying that causality is never deduced, but it is already assumed from the start and relates to how the experiment is set up. – Sextus Empiricus Oct 28 '20 at 13:43
  • Regression is just a way to *measure* the size of an assumed effect, and indeed one might get a non-zero effect size when the independent variables correlate somehow with the error term. But regression never allows to assume causality. It is the design of the experiment (some sort of intervention, some sort of control of the independent variables) that allows a regression to be used to test a causal relationship. – Sextus Empiricus Oct 28 '20 at 13:55
  • I would like to second the general premise of "you can never deduce causality from data alone". If you gave an alien who knows nothing about earth/humans some data on humans which contained two binary columns: "does this individual smoke" and "did they get lung cancer", because you give them no further information such as "the smoking occurred before the cancer", they wouldn't be able to deduce which caused which. You have to make further assumptions/incorporate other knowledge. – gazza89 Oct 30 '20 at 14:00
  • @Sextus Empiricus; At general level I agree with your explanation. Only one note. In the last part you written “The question might be a bit semantic. What is $\epsilon$ and what are the $x_i$ …” interpretation problem like this are very slippery. Indeed I’m convinced that problem like this stay at the root of ambiguity that lead some people to conflate “error” and residuals. – markowitz Oct 31 '20 at 08:55
  • For example read here: https://stats.stackexchange.com/questions/483598/zero-conditional-expectation-of-error-in-ols-regression/483612#483612 and https://stats.stackexchange.com/questions/477978/endogeneity-testing-using-correlation-test/478091#478091 .If structural causal equation and his structural error, in one side, and regression equation and his residual, in the other, are well separated … problems go away much more easily. – markowitz Oct 31 '20 at 08:55
  • @SextusEmpiricus; My comment was made only for says that “semantic points” are precisely what can lead in mistakes. No more. Your other comments made this suspect even strong. So, yes, I know what luchanacho refers to with $\epsilon$. A regression error, that come from a DGP without any causal meaning. Luckily it was unambiguous about that. Unfortunately in this situation is not possible to achieve causal conclusions. Even you, rightly, said something like that. – markowitz Oct 31 '20 at 10:31
  • Then, you said: “In that second case, by defining the $\epsilon$ as error in some linear model one restricts the type of DGP one talks about. Then it is not the exogeneity condition that makes that the regression can be interpreted causally, but it is the problem definition that does it. So that idea, I hope it comes across, is what I was referring to with semantic.”. This sound me wrong. – markowitz Oct 31 '20 at 10:32
  • Exogeneity, more than $\epsilon$ per se, is a restriction on the DGP. However the DGP must have causal meaning too, as you, rightly, said above. Now, exogeneity is not something related to regression directly or others pure statistical concept not linked to causal DGP. Exogeneity is (and/or must come back to be clear) a causal concepts and pose restrictions on the causal DGP, restrictions needed for identification. In this sense exogeneity is needed for causality. – markowitz Oct 31 '20 at 10:33
  • *the experimenter performs an experiment where a parameter is controlled by the experimenter*. Instead of parameter, did you mean values of a variable (one of the $X$s)? And what do you mean by *ensuring that the causal effect (whose effect size is unknown) is properly determined by ordinary least squares regression*? Should the estimation method (OLS) be involved at this stage? – Richard Hardy Oct 31 '20 at 17:08
  • @RichardHardy *"ensuring that...properly determined by ... regression?"* I am imagining here that the goal of an experiment is to determine the unknown size of some (assumed) causal effect. To do this one can perform a linear regression on some acquired data (if the model in the DGP is also like that), but then one must be sure that, due to the way the experiment is performed, the regressor is not indirectly correlating with the error term. With this, I am saying that I agree that this exogeneity condition is useful (but I do not see it as sufficient assumption to imply causality). – Sextus Empiricus Oct 31 '20 at 17:21
  • Thanks. I feel like using the term *determine* here can be misleading. $Y$ being [causally] determined by $X$ is one thing, the size of an unknown effect or coefficient being found out by estimation is another thing. Whether we use OLS, maximum likelihood or something else is yet another thing; in my experience, conflation of identification and estimation (or model and estimation method as in the often mentioned *OLS regression*) is a hurdle in these kinds of debates. – Richard Hardy Oct 31 '20 at 17:32
  • @RichardHardy 'determine' should have been 'estimate' – Sextus Empiricus Oct 31 '20 at 17:40
  • @SextusEmpiricus; More precisely (sorry if I move up and down with comments but I don’t know where is better) “Then it is not really the exogeneity condition that makes that the linear regression can be interpreted causally, but instead the implicit definition of $\epsilon$”. This part is unclear for me. – markowitz Oct 31 '20 at 18:22
  • @SextusEmpiricus; Your rhetorical question “Does the assumption of exogeneity make that the linear model is to be interpreted causally, or does the assumption that the linear model is to be interpreted causally allow us to make the assumption of exogeneity? Come from a misunderstanding, probably the unclearness of last part of your explanation is related. You said: “if you have a DGP that is equivalent to a linear model, then you have exogeneity.”. No, this is wrong. A DGP (= structural linear causal model) do not imply exogeneity. – markowitz Oct 31 '20 at 21:14
  • As I already said, exogeneity is a restriction on DGP, nevertheless, regardless exogeneity assumption, the DGP maintain always his causal meaning, by definition. If you read carefully my explanation you can realize that the counter circuits is solved, and exogeneity condition play clearly his role. – markowitz Oct 31 '20 at 21:14
  • If the exogeneity is excluded, the example given in my answer become the example that you looking for. Just one warning, structural causal equations (DGP), linear or not, carried out causal/interventional meanings. Then, in my example, the expectation involved in the exogeneity condition can be intended as “interventional expectation”. In this case the causal parameters are not identifiable, then the regression coefficients in the same example become biased for the causal one. – markowitz Nov 01 '20 at 01:12
  • I’m not sure about what do you mean with “explicit practical example” , if you intend a real data example I cannot give you a short answer here. If you open a new question I can try with more elaboration. Staying at simplest as possible theoretical case: if you have a structural (causal linear) equation like: $y=\theta_0 + x \theta_1 + \epsilon$ the exogeneity condition $E[\epsilon |x]=0$ is usually not satisfied for observational data; you need controls and/or more structure. In experimental case ($y$=outcome; $x$ treatment) the exogeneity condition (on structural equation !!!) is reliable. – markowitz Nov 01 '20 at 08:47
  • So, only in the last case the associated regression $y=\beta_0 + x \beta_1 + v$ identifies the causal effect of interest ($\theta_1$). Note: I know, at least in simplest cases no new “concepts” for us. What is new is the perspective. – markowitz Nov 01 '20 at 08:47
  • @SextusEmpiricus; No, you are wrong. $\epsilon$ is precisely the structural error in the structural causal model (SCM) and SCM “is” precisely the DGP. Or at least the kind of DGP we need for proper causal inference. I cannot say nothing for convincing you of that fact, if not suggest you to read the references that I gave and/or all Pearl's, and its colleagues, literature. – markowitz Nov 01 '20 at 09:17
  • Let us [continue this discussion in chat](https://chat.stackexchange.com/rooms/115725/discussion-between-sextus-empiricus-and-markowitz). – Sextus Empiricus Nov 01 '20 at 09:46
3

Let the true DGP (to be defined below) be

$$y=\mathbf{X}\beta + \mathbf{z}\alpha + \mathbf{v},$$

where $\mathbf{X}$ and $\mathbf{z}$ are regressors, and $\mathbf{z}$ is a $n \times 1$ for simplicity (you can think of it as an index of many variables if that feels restrictive). $\mathbf{v}$ is uncorrelated with $\mathbf{X}$ and $\mathbf{z}$.

If $z$ is left out of the OLS model,

$$\hat \beta_{OLS} = \beta + (N^{-1}\mathbf{X}'\mathbf{X})^{-1}(N^{-1}\mathbf{X}'\mathbf{z})\alpha+(N^{-1}\mathbf{X}'\mathbf{X})^{-1}(N^{-1}\mathbf{X}'\mathbf{v}).$$

Under the no-correlation assumption, the third term has a $\mathbf{plim}$ of zero, but $$\mathbf{plim}\hat \beta_{OLS}=\beta + \mathbf{plim} \left[ (N^{-1}\mathbf{X}'\mathbf{X})^{-1}(N^{-1}\mathbf{X}'\mathbf{z}) \right] \alpha.$$

If $\alpha$ is zero or $\mathbf{plim} \left[ (N^{-1}\mathbf{X}'\mathbf{X})^{-1}(N^{-1}\mathbf{X}'\mathbf{z}) \right] = 0$, then $\beta$ can be interpreted causally. In general, the inconsistency can be positive or negative.

So you need to get the functional form right, and include all variables that matter and are correlated with the regressors of interest.

There is another nice example here.


I think this might be a good example to give some intuition about when parameters can have a causal interpretation. This lays bare what it means to have a true DGP or the have the functional form right.

Let's say we have a SEM/DGP like this:

$$y_1 = \gamma_1 + \beta_1 y_2 + u_1,\quad 0<\beta_1 <1, \quad y_2=y_1+z_1$$

Here we have two endogenous variables (the $y$s), a single exogenous variable $z_1$, a random unobserved disturbance $u_1$, a stochastic relationship linking the two $y$s, and a definitional identity linking the three variables. We also have an inequality constraint to avoid dividing by zero below. The variation in $z_1$ is exogenous, so it is like a casual intervention that "wiggles" stuff around. This wriggling has a direct effect on $y_2$, but there is also an indirect one through the first equation.

Suppose a smart student, who has been paying attention to the lessons on simultaneity, writes down a reduced form model for $y_1$ and $y_2$ in terms of $z_1$: $$\begin{align} y_1 =& \frac{\gamma_1}{1-\beta_1} + \frac{\beta_1}{1-\beta_1} z_1 + \frac{u_1}{1-\beta_1} \\ =& E[y_1 \vert z_1] + v_1 \\ y_2 =& \frac{\gamma_1}{1-\beta_1} + \frac{1}{1-\beta_1} z_1 + \frac{u_1}{1-\beta_1} \\ =& E[y_2 \vert z_1] + v_1, \end{align}$$

where $v_1 = \frac{u_1}{1- \beta_1}$. The two coefficients on $z_1$ have a causal interpretation. Any external change in $z_1$ will cause the $y$s to change by those amounts. But in the SEM/DGP, the values of $y$s also respond to $u_1$. In order to separate the two channels, we require $z_1$ and $u_1$ to be independent in order not to confound the two sources. That is the condition under which the causal effects of $z$ are identified. But this is probably not what we care about here.

In the SEM/DGP,

$$\frac{\partial y_1}{\partial y_2} = \beta_1 =\frac{\partial y_1}{\partial z_1} \div \frac{\partial y_2}{\partial z_1} =\frac{ \frac{\beta_1}{1-\beta_1}}{ \frac{1}{1-\beta_1}}.$$

We know that we can recover $\beta_1$ from the two reduced form coefficients (assuming independence of $z_1$ and $u_1$).

But what does it mean for $\beta_1$ to be the causal effect of $y_2$ on $y_1$ when they are jointly determined? All the changes come from $z_1$ and $u_1$ (as the reduced form equation makes clear), and $y_2$ is only an intermediate cause of $y_1.$ So the first structural equation gives us "snapshot" impact, but the reduced form equations give us an equilibrium impact after allowing the endogenous variables to "settle."

Given a system of linear equations, there are formal conditions for when parameters like $\beta_1$ are recoverable. They can be a DAG or a system of equations. But this is all to say that whether something is "causal" cannot be recovered from a single linear equation and some assumptions about exogeneity. There is always some model lurking in the background, even if it is not acknowledged as such. That is what it means to get the DGP "right", and that is a crucial ingredient.

dimitriy
  • 31,081
  • 5
  • 63
  • 138
  • Thanks Dimitriy. I'm going to ask the same question than to the other answers. You are already assuming the **direction** of causality in the question. Is there a weaker assumption (than assuming the direction of causality itself) which allows you to **rule out** in a regression that causality is in the other way? TO put it differently, can we not always estimate consistently any properly defined conditional mean? What ensures we got the right direction of causality? – luchonacho Oct 27 '20 at 19:28
  • I am thinking this starting from the regression and deriving the weakest set of assumptions ensure I got causality right. Instead of starting from DGP already with causality defined, and going down the road. – luchonacho Oct 27 '20 at 19:30
  • That's what it means to have the DGP right. In any case, reverse causality can be conceptualized as an omitted variable bias. – dimitriy Oct 27 '20 at 19:32
  • I think you are looking for a holy grail here. The "right" DGP is very context specific, and the world is too weird and complicated to have a single approach that can deal with all cases. For example, would the conditions for identification of price elasticity in a demand and supply DGP generalize to the causal effect of advertising or to global warming and racial discrimination? – dimitriy Oct 27 '20 at 19:38
  • 1
    You might be right. I thinking this from the bottom up. A student comes up with a regression of x on y. I ask, what ensures you got causality? Answer would be: exogeneity. Is as "simple" as that? (although you are saying uncorrelation, which is weaker than exogeneity). – luchonacho Oct 27 '20 at 19:50
  • 1
    I think I am saying that exogeneity is not always sufficient, and I added a toy example where that is hopefully clearer. – dimitriy Oct 27 '20 at 22:11
  • @Dimitriy V. Masterov; your explanation (first part) look like omitted variables bias story that I encountered in most econometric books. Actually OVB is the closest argument about causality that we can encounter in some books. However, unfortunately, this story is not enough. Your explanation hold (consistency hold) even if the DGP do not have any causal meaning. But in this case no causal conclusions can be achieved. – markowitz Oct 28 '20 at 13:10
  • You do not make clear causal assumptions about the DGP and, worse, your choice of talk about “regressors” directly in the DGP put your story (as several books) in insurmountable ambiguity. – markowitz Oct 28 '20 at 13:10
  • Details about deficiencies of OVB for causal inference with regressions, read here: https://stats.stackexchange.com/questions/373385/is-a-regression-causal-if-there-are-no-omitted-variables?noredirect=1&lq=1 – markowitz Oct 28 '20 at 13:10
  • 1
    @markowitz I think your concerns are addressed in the second part of my answer, where I get into what it means to have the DGP right. – dimitriy Oct 28 '20 at 13:28
  • Indeed I focused on the first part. The second sound better, however I’m not convinced that my concern are properly addressed there. This mainly because your second part are presented as a (good) example but no more. Are not presented as a correction/clarification of the first part. Therefore, the first part remain and ambiguities too (similar situation in some books). – markowitz Oct 29 '20 at 08:30
  • Moreover I have some points about the second part too. Some causal concepts are used but DGP/SEM are introduced without clear causal meaning. Now, in my opinion them must, always, carried out causal meaning but some people sustain the other positions. Than causal assumptions must be clearly made, … especially if we consider the asker issues. – markowitz Oct 29 '20 at 08:30
  • Moreover you said: “But this is all to say that whether something is "causal" cannot be recovered from a single linear equation and some assumptions about exogeneity”. This sentence sound me wrong. Even if in many economic situations systems and more complex structure is needed, in theoretical ground, what the asker is interested in, for causal inference we can use a single structural equation and a related single regression equation. And exogeneity is precisely the crucial condition for identification there. Infact this strategy is the standard in experimental approach. – markowitz Oct 29 '20 at 08:30
  • I think at this point you are restating things you have said before, so it is unlikely that we will satisfy another with additional discussion or that this will prove illuminating for others. But I would add that even in the case of an RCT and a single equation, you need additional assumptions beyond exogeneity to get causality. For example, we also require SUTVA/No GE effects. – dimitriy Nov 04 '20 at 22:37
3

Short answer:

There is no explicit way of proving causality. All claims of causality must be logically derived, i.e. through common sense (theory). Imagine having an operator (like correlation) which would return causality or non-causality between variables: you would be able to perfectly identify the sources and relations of anything in the universe (e.g. what/who would an interest rise have an impact on; which chemical would cure cancer etc.). Clearly, this is idealistic. All conclusions of causality are made through (smart) inferences from observations.


Long answer:

The question of which variables cause another is a philosophical one, in the sense that it must be logically determined. For me, the clearest way to see this is through the 2 classical examples of a controlled vs non-controlled experiment. I will go through these while emphasizing how much is statistics and how much is common sense (logic).

1. Controlled experiment: fertilizer

Assume you have an agricultural field divided into parcels (squares). There are parcels on which crops $(y)$ grow with and without sunlight $(X_1)$, with and without good nutrients $(X_2)$. We wish to see if a certain fertilizer ($X_3$) has an impact or not on the crop yield $y$. Let the DGP be: $y_i = \beta_0+\beta_1 X_{1i}+\beta_2 X_{2i}+\beta_3 X_{3i} +\varepsilon_i$. Here $\varepsilon_i$ represents the inherent randomness of the process, i.e. the randomness that we would have in predicting crop yield, even if this true DGP were known.

Exogeneity: [skip if clear]

The strong exogeneity assumption $E[\varepsilon_i|\textbf{X}]=0$ that you mention is needed in order for the coefficients estimated by OLS $\hat\beta$ to be unbiased (not causal). If $E[\varepsilon_i|\textbf{X}]=c$ where $c$ is any constant, all $\hat{\beta_j}$ except for the intercept $\hat{\beta_0}$ are still unbiased. Since we are interested in $\beta_3$ this is sufficient. (Side note: other weaker assumptions such as weak exogeneity and orthogonality between $X$ and $\varepsilon$ are sufficient for unbiasedness.) Saying that $E[X|Z]=c$ for any 2 random variables $X$ and $Z$ means that $X$ is not systematically dependent in the mean on $Z$, i.e. if I take the mean ($\to\infty$) of $X$, for any pair of $(X,Z)$ I will get (approx.) the same value each time, so knowing $Z$ does not help at all in predicting the mean of $X$ (e.g. $E[X|Z=10]=E[X|Z=10000]=E[X|Z=-5]=E[X]=c$)

Why is this interesting? Remember, we want to know if the fertilizer $X_3$ has an impact or not ($\beta_3=0?$) on the crop yield $y$. By spraying fertilizer on random parcels, we implicitly "force" exogeneity of $X_3$ compared to all other regressors. How? Well, if we randomly spray fertilizer on a parcel, no matter if it has sunlight or not, if it has good nutrients or not and if we then take the mean value of fertilizer for sunny parcels, it will be the same as the mean value for non-sunny parcels. Same with nutrient-rich parcels. E.g: the results of the table below hold approx. for large numbers. It makes sense after all that, if $X_3$ is independent of $X_1$, its mean should not change (significantly) as $X_1$ changes. enter image description here

So, in other words $X_3$ is exogenous wrt $X_1,X_2$, i.e. $E[X_3|X_1,X_2]=c$. This means that effectively, if we want to estimate $\beta_3$ unbiasedly, we don't need $X_1,X_2$. Hence these two variables (sun, nutrients) can be treated as randomness and incorporated into the noise term, giving the regression: $y_i = \beta_0 + \beta_3 X_{3i} + \epsilon_i$, where $\epsilon_i = \beta_1 X_{1i} + \beta_2 X_{2i} + \varepsilon_i$. Hence, the noise term can also be interpreted as a collection of all other variables that influence the response $y$, but not in a systematic fashion in the mean. (Note that $\hat\beta_0$ is biased; further note that exogeneity is weaker than independence, since the variables could be related in a higher moment instead of the mean, such as the variance, but exogeneity would still hold, see heteroskedasticity).

Causality:

Now where does causality come into play? So far we have only shown that randomly distributing fertilizer on better or worse parcels lets us look at crop yield and fertilizer alone, without taking into account the other variables (sun, nutrients), i.e. "forcing" exogeneity of fertilizer and thus all other variables into the noise term. Causality itself was and will not be proven. However, if $\hat\beta_3$ turns out to be significant, we can logically conclude that, since the randomization of fertilizer effectively "de-relates" it from all other variables (in the mean), it must have an impact on crop yield, since all other variables have no systematic impact in this setting.

In other words: 1) we used exogeneity to statistically prove that this is the condition we need for unbiased estimators (for OLS); 2) we used randomization to get this exogeneity and get rid of other uninteresting variables; 3) we logically concluded that, since there is a positive relation, it must be a causal one.

Notice that 3) is just a common sense conclusion, no statistics involved as in 1) or 2). It could theoretically be wrong, since e.g. it could have been that the fertilizer was actually a 'placebo' ($\beta_3=0$) but was distributed only on the sunny and nutrient-rich parcels by pure chance. Then the regression would wrongly show a significant coefficient because the fertilizer would get all the credit from the good parcels, when in fact it does nothing. However, with a large number of parcels this is so unlikely that it is very reasonable to conclude causality.

2. Uncontrolled experiment: wage and education

[I will eventually (?) return with an edit to continue here later; topics to be addressed OVB,Granger-causality and instantaneous causality in VAR processes]


This question is precisely the reason why I started learning statistics/data science - shrinking the real world into a model. Truth/ common sense/ logic are the essence. Great question.

PaulG
  • 793
  • 2
  • 10
  • You speak about some interesting things and examples, but I have some points. First of all, in both short and long answer there is not clear reply to the main question. Second, the conditions declared from the asker are not used, at least not clearly (given a join dist characteristics = DGP: explain links among: DGP/regression/assumpions). – markowitz Nov 03 '20 at 07:29
  • About exogeneity you write: “The strong exogeneity assumption $E[\epsilon |X]=0$ that you mention is needed in order for the coefficients estimated by OLS $\hat{\beta}$ to be unbiased (not causal).” … “we used exogeneity to statistically prove that this is the condition we need for unbiased estimators (for OLS)” these sentences attribute at exogeneity only statistical meanings and no causal one. This position is common but as I said several time, and in several place, it is wrong. – markowitz Nov 03 '20 at 07:29
2

Regression is just a series of statistical technique to strengthen causal inferences between two variables of interest by controlling for alternate causal explanations. Even a perfectly linear relationship (r2=1) is meaningless without first establishing the theoretical basis for causality. Classic example being the correlation between icecream consumption and pool drownings--neither causes the other by both are caused by summer weather.

The point of experiments is to determine causality, which typically requires establishing that: 1) one thing happened before the other, 2) that the putative cause had some explanation mechanism for affecting the outcome, and 3) that there are no competing explanations or alternate causes. Also helps if the relationship is reliable--that the lights go on every time you hit the switch. Experiments are designed to establish these relationships, by controlling conditions to establish chronological sequence and control for possible alternate causes.

Pearl (Pearl, J. (2009). Causality. Cambridge university press) is a good read, but beyond that lies a (fascinating) philosophical rat-hole regarding causation and explanation.

Mox
  • 230
  • 1
  • 12