9

X - treatment variable

Y - outcome variable

Z - confounder

DAG:

enter image description here

Model:

y ~ x + z

Question

If x and z strongly correlate with each other, then multicollinearity assumption is violated? Also, this model causes the b coefficient of x to be smaller or close to zero?

How you guys solve such situations ? The DAG gives a reason, but there is multicollinearity. Does your approach differ is the correlation is moderate, weak?

Robert Long
  • 53,316
  • 10
  • 84
  • 148
st4co4
  • 1,499
  • 5
  • 10
  • “this model causes the b coefficient of x to be smaller or close to zero?” What makes that a problem for you? What is the goal of your modeling? – Dave Aug 07 '20 at 12:44
  • If Z is a con-founder, it will not affect Y directly. Your DAG model is incorrect ? –  Aug 07 '20 at 13:00
  • 2
    @SubhashC.Davar this is a text book case of confounding, there is nothing wrong with the DAG as far as I can see, so I don't understand your point. A confounder is a cause or a proxy for a cause of both the exposure and the outcome. – Robert Long Aug 07 '20 at 13:02
  • If it is a cause of both the exposure and the outcome – how can you think of "multicollinearity" –  Aug 07 '20 at 13:11
  • @SubhashC.Davar See the simulations in my answer for an examples – Robert Long Aug 07 '20 at 13:12

1 Answers1

7

Multicollinearity will only be a problem if the correlation between X and Z is 1. In that case, X and Z can be combined into a single variable which will provide an unbiased estimate. We can see this with a simple simulation

> set.seed(1)
> N <- 100
> Z <- rnorm(N)
> X <- Z   # perfect collinearity
> Y <- 4 + X + Z + rnorm(N)
> lm(Y ~ X) %>% summary()

Call:
lm(formula = Y ~ X)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.8768 -0.6138 -0.1395  0.5394  2.3462 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.96231    0.09699   40.85   <2e-16 ***
X            1.99894    0.10773   18.56   <2e-16 ***

which is biased. But adjusting for Z will not work due to perfect collinearity:

lm(Y ~ X + Z) %>% summary()

Call:
lm(formula = Y ~ X + Z)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.8768 -0.6138 -0.1395  0.5394  2.3462 

Coefficients: (1 not defined because of singularities)
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.96231    0.09699   40.85   <2e-16 ***
X            1.99894    0.10773   18.56   <2e-16 ***
Z                 NA         NA      NA       NA    

So we combine X and Z into a new variable, W, and condition on W only:

> W <- X + Z
> lm(Y ~ W) %>% summary()

Call:
lm(formula = Y ~ W)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.8768 -0.6138 -0.1395  0.5394  2.3462 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.96231    0.09699   40.85   <2e-16 ***
W            0.99947    0.05386   18.56   <2e-16 ***

and we obtain an unbiased estimate.

Regarding your point:

this model causes the b coefficient of x to be smaller or close to zero?

No, that should not be the case. If the correlation is high, the estimate may lose some precision, but should still be unbiased. Again we can see that with a simulation:

> nsim <- 1000
> vec.X <- numeric(nsim)
> vec.cor <- numeric(nsim)
> #
> set.seed(1)
> for (i in 1:nsim) { 
+ 
+   Z <- rnorm(N)
+   X <- Z + rnorm(N, 0, 0.3) # high collinearity
+   vec.cor[i] <- cor(X, Z)
+   Y <- 4 + X + Z + rnorm(N)
+   m0 <- lm(Y ~ X + Z)
+   vec.X[i] <- coef(m0)[2]
+   
+ }
> mean(vec.X)
[1] 1.00914
> mean(vec.cor)
[1] 0.9577407

Note that, in the first example above we knew that data generating process and because we knew that X and Z had equal influence so that a simple sum of both variables worked. However in practice we won't know the data generating process, and therefore, if we do have perfect collinearity (not likely in practice of course) then we could use the same approach as in the 2nd smulation above and add some small random error to Z which will uncover the unbiased estimate for X.

Does your approach differ is the correlation is moderate, weak?

If the correlation is moderate or week there should be no problem in conditioning on Z

Robert Long
  • 53,316
  • 10
  • 84
  • 148
  • Thank you so much Robert! Correlation was moderate. Thus, I'll proceed with the analysis! – st4co4 Aug 07 '20 at 14:30
  • Does your suggestion allow you to separately identify the effect of x and z if it was not the same? – dimitriy Aug 08 '20 at 14:46
  • @DimitriyV.Masterov I would never want to know the association of `Z` with `Y` if `X` is the exposure because `Z` is a confounder. If `Z` was the exposure then `X` would be a mediator so we would not want to include `X` in the model at all. – Robert Long Aug 08 '20 at 14:53
  • 2
    Let me be clearer. We care about the unbiased effect of X on Y. If `Y – dimitriy Aug 08 '20 at 15:04
  • @DimitriyV.Masterov Good point! Actually I had not thought ot that until now ;) In that case I would induce imperfect correlation by adding some very small random error to `Z`, as in my 2nd code block above, which would still recover an unbiased estimate for `X` – Robert Long Aug 08 '20 at 15:22