12

I know the famous expression "correlation does not imply causation". In a DAG, this situation might look like

$$ X \leftarrow U \rightarrow Y $$

Here even though $X$ and $Y$ are not causally related, the presence of confounder $U$ induces a correlation between them.

I also know that two variables that are causally related can be uncorrelated, as correlation is a linear measure of association. For example, the correlation between $X$ and $Y$ with $Y = X^2$ is $0$

In the context of counterfactual formal causal reasoning, my question is: If there is no unblocked path between $X$ and $Y$, would be ever expect there to be non-zero correlation between $X$ and $Y$ in the infinite sample limit? I know that in finite samples, spurious correlations can appear simply due to chance, but asymptotically, is it possible that if there is no open causal path between two variables, that we can expect a non-zero correlation, or really, any measure of association to be positive, between them?

In short: can two d-separated variables have an expected non-zero correlation? Answers should use counterfactual causal reasoning formalisms.

Alexis
  • 26,219
  • 5
  • 78
  • 131
Mir Henglin
  • 484
  • 2
  • 10
  • 4
    Would 30,000 examples be a good start? https://www.tylervigen.com/spurious-correlations – whuber Jan 03 '22 at 17:33
  • Although the duplicate is a different question, it is so closely related that answers to your question appear there, too. – whuber Jan 03 '22 at 17:40
  • Welcome to CV Mir Henglin! I hope you will forgive my editing in a link to the origin of "correlation does not imply causation". :) – Alexis Jan 03 '22 at 17:49
  • @whuber Those spurious correlations are ones that I believe have arisen in a finite sample, or have arisen due to confounding. I am asking specifically about the situation where this is not the case – Mir Henglin Jan 03 '22 at 18:12
  • @Alexis I am not asking about under what conditions could an observed correlation be considered a measure of causation, I am asking specifically in the case where there is no confounding (no unblocked path), and no conditioning, could we ever expect in the infinite sample limit a correlation between two variables? – Mir Henglin Jan 03 '22 at 18:14
  • Not sure why you are addressing that comment to me (I did not vote to close your question). Does the linked Q&A in the close votes not answer this for you? (I think there's a good case to reopen your question.) – Alexis Jan 03 '22 at 18:52
  • 1
    Alll the examples can be made as long as possible, creating the possibility of arbitrarily large samples. The whole point is that literally *any* two processes, even when completely independent of each other (causally and probabilistically), that undergo similar *deterministic* changes over time, will have non-zero correlations. If that's what you mean by "confounding," then so be it--but there doesn't seem to be a new question involved. – whuber Jan 03 '22 at 18:58
  • @whuber I think this is specifically the question "Does lack of correlation between *d*-separated variables imply lack of causation?" And the answers on the linked question in the VTC do not explicitly address this. (I also think there is a subtler critique to make about counterfactual formal causal inference in dynamic models and stochastic processes, but that's not what Mir Henglin is asking, I think.) – Alexis Jan 03 '22 at 19:01
  • @Alexis As far as I can tell, the posts at https://stats.stackexchange.com/a/538/919, https://stats.stackexchange.com/a/496642/919, https://stats.stackexchange.com/a/296417/919 all provide answers. Besides, it's scarcely credible that we don't already have loads of posts about the relationships (or lack thereof) between correlation and causation. If you think one particular thread doesn't fully answer this question, you should have no problem finding another. – whuber Jan 03 '22 at 19:07
  • What Mir may be ultimately talking about with "no unblocked path" is conditional independence. This then implies a conditional correlation of 0 (if a correlation exists). – Matthew Gunn Jan 03 '22 at 19:07
  • 2
    If it helps to rephrase my question, i believe it would be: in expectation, can d-separated variables be correlated? Does that clear things up? – Mir Henglin Jan 03 '22 at 19:07
  • 2
    @whuber I agree that they all provide general answers about correlation and causation. They do not provide answers specifically in the context of counterfactual formal causal reasoning, viz. *d*-separation in a causal graph, which I think is worth answering. – Alexis Jan 03 '22 at 19:12
  • 2
    Looking for correlation without plausible causation in a large database for a classroom example, I found that rainfall among US states is strongly negatively correlated with high school graduation rate. [Rainfall is heavy in a few southern states with poor HS grad rate; rainfall is low in Utah and a few other mountain west states where HS grad rate is high. Does wet weather rot HS students' brains? // Not totally clear what an 'open causal path' is. – BruceET Jan 03 '22 at 19:14
  • 1
    @Alexis Fair enough: I look forward to reading good answers of this nature! – whuber Jan 03 '22 at 19:14
  • 2
    Thank you @Alexis & whuber for helping to make my question clearer! – Mir Henglin Jan 03 '22 at 19:17

2 Answers2

6

No.

With the caveat that the direct causal relationships embedded in a DAG are beliefs (or at least presuppositions of belief), so that the counterfactual formal causal analysis one performs is predicated on the DAG being true, then your question gets at the utility of this kind of reasoning, because in this worldview correlations can only be interpreted causally given the d-separation of the path from one variable to another. If a set of variables (say, $L$) is sufficient to d-separate the path from $A$ to $Y$ (say, $Y$ as putative effect, and $A$ as putative cause of $Y$), then:

  • one infers a $\text{cor}(Y,A|L) \ne 0$ as evidence that $A$ causes $Y$ (this is nonstandard notation… the folks I am familiar with would more typically write something like $P(Y=1|A=0,L) - P(Y=1|A=1,L) \ne 0$ for levels of $L$ instead of speaking specifically of correlation… likely because DAGs and the inferences drawn from them are nonparametric, but Pearson's correlation is linear, and Spearman's is monotonic), and
  • one infers $\text{cor}(Y,A|L) = 0$ as evidence that $A$ does not cause $Y$.

That is the point of this kind of causal analysis. (And is also why it offers value by directing critique of an analysis specifically to the construction of $L$ and the DAG.)

Except, kinda yes (but still no).

Back to the caveat about DAGs embodying beliefs. Those beliefs may be more or less valid for any given analysis. In fact, the DAG you provide indicates a good reason why: most variables we might imagine (whether fitting into $L$, $Y$, or $A$ in my nomenclature above) are themselves caused by some other variable… likely a variable in the set of unmeasured prior causes $U$. This is why the validity of causal inferences from observation studies are always subject to threats from unmeasured backdoor confounding (i.e. this quality is part of what we mean by 'observational study'), and why randomized control trials have a special kind of value (even though causal inferences from randomized control trials are just as subject to threats from selection bias as observational study designs).

Many great examples of correlations existing between 'causally unrelated' variables and processes are provided in links in comments to Mir Henglin's question. I would argue that rather than falsifying my unqualified "No." at the start of my answer, these indicate merely that the DAG has not actually been expanded to cover all the causal variables at play: the set of causal beliefs is incomplete (for example, see Pearl's point about incorporating hidden variables into the DAG). @whuber also made an important comment along these lines:

The whole point is that literally any two processes, even when completely independent of each other (causally and probabilistically), that undergo similar deterministic changes over time, will have non-zero correlations. If that's what you mean by "confounding," then so be it—but there doesn't seem to be a new question involved.

There are competing interpretations about the appropriateness of time as a causal variable in counterfactual formal causal reasoning. I will point out that:

  • DAG formalisms are explicit only about the qualitative temporal ordering of variables but
  • DAGs are otherwise silent about quantitative lengths of time.

So there is a case to be made that lengths of time can serve as a confounding variable in counterfactual formal causal reasoning.

The upshot is to repeat my opening caveat: conditional on a DAG being true, then if a path from $A$ to $Y$ is d-separated, then $A$ cannot cause $Y$ if $\text{cor}(Y,A|L) = 0$.

Alexis
  • 26,219
  • 5
  • 78
  • 131
6

In short: can two d-separated variables have an expected non-zero correlation?

No, it is not possible.

More precisely: d-separation warrant us that, in a DAG $G$, if two variable $X$ and $Y$ are d-separated by a set of variables $Z$ it is implied that $X$ and $Y$ are independent conditional on $Z$. Note that $Z$ can be the empty set too. Now, you speak about "correlation" and not "conditional correlation", however you speak about d-speration too. From that I suppose that the two d-separated variables that you use are so for $Z$=empty set. Therefore, no correlation nor any kind of statistical association can appear in the population.

For example in your DAG

$$ X \leftarrow U \rightarrow Y $$

$X$ and $Y$ are d-separated given $U$

Moreover you write

For example, the correlation between $X$ and $Y$ with $Y = X^2$ is $0$

I guess the idea in your mind but this statement is not true in general. Indeed if $X$ have distribution $U[0,1]$ this correlation is $>0$.

markowitz
  • 3,964
  • 1
  • 13
  • 28