Can two linear regression variables be perfectly correlated but not share a single causal chain ancestor?

Question

A causal chain lists event (or fact) $y$ with all its causal antecedents.

We make a model of the following form:

$$ y = \beta_0 + \beta_1x_1 + \beta_2x_2 + \epsilon $$

$\hat\beta_1$ has a p-value <0.001 but the true $\beta_1 \ne 0$ because the true $x_1$ is perfectly correlated with the true $y$.

In a deterministic paradigm, is it possible for two true population variables to be perfectly correlated but share no causal chain or causal ancestor, i.e.,

$x_1 \not\to ... y$ AND
$y \not\to ... x_1$ AND
$Z \not\to ... x_1, Z \not\to ... y$?

Ashe · Answer 1 · 2015-03-11T14:38:46.033

2

I think perhaps you're mixing the concept of correlation with linear regression. They are similar, but answer somewhat different questions. Graphpad has a nice description here, for further reading. @Juan offers a nice example, but here's something to build on it.

Correlation measures two variables' tendency to go up and down together (either in tandem for positive $r$, or in opposite for negative $r$). This can imply some causal link, but the correlation coefficient alone cannot offer concrete evidence for that (or lack of evidence). An amusing website I found while doing some quick Googling was this. Beware thy cheese consumption.

The $P$-value you reference (presumably for $\beta_1$, not $x_1$) refers to the regression coefficient in the multiple regression. Regression addresses the ability of $x_i$ to predict $y$. The $P < 0.001$ implies it is very unlikely that the $\beta_1$ is equal to zero, and thus $x_1$ has some predictive power of $y$ given the specified model. Even then though, this doesn't answer the question of causation, but rather the unit change in $y$ given a unit change in $x_1$.

Providing evidence for causation comes from experimental design, not from a statistical relationship. I don't recall which statistician said it, but there is "no causation without manipulation." If your experimental design has no basis to demonstrate causation nor solid theory to imply it, then the low $P$-value can't add evidence to the presumption of causation. It can spur further experiments to find a causal chain, however!

EDIT: For a more technical read, see here. See section 2 for the distinction between association and causation. Regression is directly listed as an associational concept, not a causal concept. In brief, the relevant concepts can be summed up like this from the paper, "An associational concept is any relationship that can be defined in terms of a joint distribution of observed variables, and a causal concept is any relationship that cannot be defined from the distribution alone."

edited Mar 11 '15 at 14:38

answered Mar 11 '15 at 14:12

Ashe

1,085
10
25

Thanks! However, it does not matter to me if (1) $x_1$ causes $y$, (2) $y$ causes $x_1$, or (3) $z$ causes both $y$ and $x_1$, so why do I need experimental design to tell me which of these three is correct? I want to say, if we are correct that $\beta_1 \ne0$, then one of these three options MUST be true. Can I? – jtd Mar 11 '15 at 14:27
@jtd I don't think those three options are your only ones. What about $z$ causing $y$ and $w$ causing $x$? Or $u$ causing both $z$ and $w$ causing $y$ and $x$ in turn? Or an infinitely large number of possibilities, of which no causation is one of the possibilities. It is only with changing conditions (experimental design) that causation can be affirmed. I'll edit the above with a more technical paper, but in essence it draws a distinction between association and causation. Association is drawn from distributions, causation from changing conditions. – Ashe Mar 11 '15 at 14:37
Tell me if this is correct, please: <0.001 includes likelihood of $z \to y, w \to x$ and all other scenarios where $x_1$ and $y$ do not share a common causal antecedent. Note that $u \to z \to y, u \to w \to x$ does show a common antecedent (is a (3)), however. If we are correct that $\beta_1 \ne 0$, what other relation can exist between $x_1$ and $y$ other than (1), (2), or (3)? – jtd Mar 11 '15 at 14:59
The $P$ <0.001 is a measure of the probability that parameter estimate deviates from the null hypothesis given an assumed distribution of parameter estimates consistent with linear regression. This is a relationship that is fully defined by distributions alone, making it an associational concept alone. The (4) in your list is that no causation exists. To establish causation, you need concepts like intervention, randomization, stability, etc, not just a measure of deviation according to an assumed distribution. These are experimental design concepts. I would highly recommend the UCLA paper. – Ashe Mar 11 '15 at 15:09
Thanks, I have to read paper. To simplify, would you agree: $\beta_1 = 0 \to (4); \beta_1 \ne 0 \to (1,2,or 3)$? Experiments are next step after this first decision. Or do you say $\beta_1 \ne 0 \to (1,2,3,...)$? If $\beta_1 \ne 0$ is true, is (4) still in this set for you? If so, what are other options besides (4)? – jtd Mar 11 '15 at 15:23
I would say that without experimental design considerations, one cannot say anything about causation, positive or negative. $\beta_1 = 0 \to$ anything, nor $\beta_1 \ne 0 \to$ anything. Your regression _implies_ some causation, and further experiments will elucidate that. An example where you could elucidate causation would be to cause an intervention on two randomized cohorts of samples, and test the difference in parameter estimates. To me, (4) remains on the table until design shows otherwise. – Ashe Mar 11 '15 at 15:40

score 1 · Answer 2 · answered Mar 11 '15 at 06:44

1

Correlation is a tricky concept, since it denotes that two variables show dependence based merely on statistical concepts. Correlation does not care about the nature behind the variables, therefore you can create a variable X = {1,2,3,4,5} and y = {2,4,6,8,10} (which I just made) , they don't share ancestors and have a correlation coefficient of 1 which is perfect. What I am trying to say, is that never trust just the correlation coefficient to judge relationship between variables.

answered Mar 11 '15 at 06:44

Juan Zamora

209
2
8

but does your data and data collection follow the assumptions needed for valid multiple regression (e.g., normality and independence)? – jtd Mar 11 '15 at 13:15
@jtd Correlation and regression are different methods. For correlation, the only assumption is the interval nature of the variables. See [here](http://stats.stackexchange.com/questions/48450/assumptions-of-correlation-coefficient) – Ashe Mar 11 '15 at 13:26
The problem with your "example", is that your two lists of numbers are not "variables" in any statistical sense, they are just made-up lists of numbers, without any connection to anything in the real world. Therefore, they do not qualify as statistical variables. – kjetil b halvorsen Mar 11 '15 at 14:52
That's why is a made up example. However what I am trying to say is that is does not matter if the response is house pricing and the variable X is sqft from a random sample of n houses. The correlation or covariance coefficient is not proving the relationship to be valid in more than in a mathematical sense no different from my initial x y variables. – Juan Zamora Mar 11 '15 at 14:57
Correlation implies the existence of some kind of underlying causal relationships. – Neil G Mar 17 '15 at 20:55

Neil G · Answer 3 · 2015-03-18T03:11:25.880

0

Yes, they can be $d$-connected by observed effects. Since $x_2$ and $x_3$ are also observed, then for example:

$x_1 \rightarrow x_2 \leftarrow z \rightarrow x_3 \leftarrow y$

means that $x_1$ and $y$ can be correlated given $x_2, x_3$ observed despite neither being the cause of the other nor them sharing any common cause nor any common descendent.

For more information, read about $d$-separation.

edited Mar 18 '15 at 03:11

answered Mar 17 '15 at 20:51

Neil G

13,633
3
41
84

I apologize for my confusion and I'm still digesting Pearl's 2009 paper on SCM diagrams, but here is my question: If the *True* $x_1$ and *True* $y$ are perfectly correlated, does that perfect *True* correlation exist whether we observe $x_2$ or not? I assume "yes" as $Cor(x_{1.TRUE}, y_{TRUE}) = 1$ has no other terms but $x_1$ and $y$. But in a deterministic paradigm it naively seems that *True* correlation among [randomly distributed] population variables $x_1$ and $y$ should not occur without one of the three causal diagrams in my question upsetting that randomness (certainly not 1.00). – jtd Mar 17 '15 at 21:32
The correlation does depend on observation of $x_2$ for soem causal diagrams such as the one I illustrated. Your model takes $x_2$ into account, so you are assuming it is observed. – Neil G Mar 17 '15 at 22:26
@jtd: more importantly, if you have a correlation that is independent of any other observations then one of the three cases you described is the case. – Neil G Mar 17 '15 at 22:53

Can two linear regression variables be perfectly correlated but not share a single causal chain ancestor?

3 Answers3

Linked