4

Let's say I have 4 or 5 variables of interest which I measured for each of about 500 people. So, imagine a 500x5 matrix. Because of the nature of these variables, it is obvious that there must be (and indeed there is) a correlation between each possible pair of these variables, i.e. columns of the matrix. Is there any way of testing the causality? I used to think no, but then I heard something about Structural Equation Modelling, which to me sounded like a claim that there may be a way. But I really don't have a clue about this topic, and I don't know how and where to begin.

To make my question more concrete, I have a hypothesis that variables A and B determine variable C to a large degree, and variable D is largely determined by variable C, and this is why they are all correlated (because A depends on B as well). An alternative hypothesis might be that variables A, B, and C all determine variable D directly and that is why they are all correlated (because A and C depend on B as well). Obviously, none of the correlations are perfect, and there's always some statistical noise in each relationship. For clarity, I made a sketch of the two competing hypotheses (models) in my example.

Is there a pragmatic way (SEM or other) to show whether the data favours one of these models over the other? If yes, what's the best or most pragmatic way to implement this procedure, e.g. in R?

enter image description here

guesttt
  • 41
  • 2
  • Excellent question! So glad you included a causal diagram - way too few causal questions on this site include a diagram. One fairly straightforward way you could determine if the model on the right is viable is to condition on $C,$ and see if causal information can still flow from $A$ and $B$ to $D.$ If you condition on $C,$ and the model on the right is correct, then $A$ and $B$ should not be able to causally influence $D.$ – Adrian Keister Jul 06 '20 at 14:48

1 Answers1

4

It depends what your research question is. I appreciate that you have a causal theory in mind, but it is always good to keep in mind that generally, at least with regression you need to decide what is the main exposure - and you will be typically wanting to estimate the total causal effect for that variable on the outcome. If you want to look at the whole picture and see all the direct and indirect estimates then a structural equation model (path diagram in this case) would be a better approach. However, regression is usually the approach most people choose:

In the model on the left, if $B$ is the main exposure then $A$ and $C$ are mediators and should not be conditioned on. However if $C$ is the main exposure then $B$ is a confounder of the path $C\rightarrow D$ and should be conditioned on. $A$ is a decendent of $B$ so this can be treated as a competing exposure and will increase the precision of the estimate for $B$. Similar logic applies if $A$ is the main exposure.

In the model on the right, if $C$ is the main exposure then you don't condition on either $A$ or $B$ because they are upstream. If A is the main exposure then $B$ confounds the $A\rightarrow C$ path (and hence $A \rightarrow C \rightarrow D$ and) should be conditioned on. If $B$ is the main exposure then $A$ is a mediator and should not be conditioned on.

Take a look here for more on this.

So as to the main question:

Is there a pragmatic way (SEM or other) to show whether the data favours one of these models over the other? If yes, what's the best or most pragmatic way to implement this procedure, e.g. in R?

So bearing the above in mind, yes, you can consider C to be the main exposure. Let's do a simple simulation. First for the model on the left:

rm(list=ls())
N <- 100
B <- rnorm(N)
A <- B + rnorm(N)
C <- B + rnorm(N)
D <- A + B + C + rnorm(N)

as per the discussion above, in this case we condition on B and A:

summary(lm(D ~ C + B + A))

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -0.1293     0.1210  -1.069    0.288    
C             0.9208     0.1252   7.355 6.41e-11 ***
B             0.9741     0.2163   4.503 1.88e-05 ***
A             1.0682     0.1181   9.048 1.65e-14 ***

And we get signficant contributions from A and B, as expected.

Now, with the model on the right, A and B still have the same causal structure, so we just need to simular a new C and D:

C1 <- B + A + rnorm(N)
D1 <- C1 + rnorm(N)

Now, if we just wanted the causal effect of C1 on D1 we would not control for A and B, bue here we want to compare the same model we used above:

summary(lm(D1 ~ C1 + A + B))

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.031499   0.107806   0.292   0.7708    
C1          0.851798   0.104286   8.168 1.25e-12 ***
A           0.005409   0.166517   0.032   0.9742    
B           0.291614   0.169539   1.720   0.0886 .  

...and now we find the contibutions from A and B drop out.

With an SEM approach:

library(lavaan)

dt1 <- data.frame(A,B,C,D)

m1.true <- '
  D ~ A + B + C
  C ~ A
  A ~ B'

m1.true.res <- sem(m1, dt1)
summary(m1.true.res)

we obtain:

Regressions:
                   Estimate  Std.Err  z-value  P(>|z|)
  D ~                                                 
    A                 1.068    0.116    9.234    0.000
    B                 0.974    0.212    4.596    0.000
    C                 0.921    0.123    7.507    0.000
  C ~                                                 
    A                -0.080    0.094   -0.855    0.393
    B                 0.996    0.141    7.057    0.000

and all of this is consistent with the first model. But if we try to fit the second model we get:

m1.false <- '
  D ~ C
  C ~ A + B
  A ~ B'

m1.false.res <- sem(m1.false, dt1)
summary(m1.false.res)


Regressions:
                   Estimate  Std.Err  z-value  P(>|z|)
  D ~                                                 
    C                 2.057    0.173   11.880    0.000
  C ~                                                 
    A                -0.080    0.094   -0.855    0.393
    B                 0.996    0.141    7.057    0.000
  A ~                                                 
    B                 1.139    0.098   11.622    0.000

and from this we can see the estimate for C ~ A is small and non significant which is not consistent with the 2nd model. We also see that the estimate for D ~ C is larger then we expected - but we would not know this if we hadn't simulated the data.

Robert Long
  • 53,316
  • 10
  • 84
  • 148