1

Question: Is there any evidence that collinearity causes some predictors in this model to be insignificant?

Using R, I calculated the correlations of the predictor variables of a linear model.

         lcavol lweight  age  lbph   svi   lcp gleason pgg45
lcavol    1.00    0.19 0.22  0.03  0.54  0.68    0.43  0.43
lweight   0.19    1.00 0.31  0.43  0.11  0.10    0.00  0.05
age       0.22    0.31 1.00  0.35  0.12  0.13    0.27  0.28
lbph      0.03    0.43 0.35  1.00 -0.09 -0.01    0.08  0.08
svi       0.54    0.11 0.12 -0.09  1.00  0.67    0.32  0.46
lcp       0.68    0.10 0.13 -0.01  0.67  1.00    0.51  0.63
gleason   0.43    0.00 0.27  0.08  0.32  0.51    1.00  0.75
pgg45     0.43    0.05 0.28  0.08  0.46  0.63    0.75  1.00

I thought that there is a relatively strong correlation between lcp and lcavol, lcp and svi, lcp and gg45, gleason and gg45.

Would a correlation value of >0.5 be considered a strong correlation (ie: one of the variables would do a good job of representing the other)? How do we determine the minimum benchmark for when two variables have a strong correlation?

air_nomad
  • 13
  • 4
  • 1
    Individual correlations tell you little unless they are very close to $1$ in absolute value. It is possible for all correlations to be relatively low but for one variable to be *perfectly* collinear with the rest of them. See https://stats.stackexchange.com/a/14528/919 for an analysis of this kind of situation. – whuber Oct 09 '20 at 16:39

1 Answers1

0

0.5 would not be considered high correlation in terms of the effect on standard errors. In my experience it only tends to become problematic at correlations quite close to 1.

A little simulation can demonstrate this.

> library(MASS)
> set.seed(1)
> N <- 20
> rho <- 0.5
> Sigma <- matrix(c(1, rho, rho, 1), 2, 2)
> u <- mvrnorm(n = N, c(10, 10), Sigma, empirical = TRUE)
> X1 <- u[, 1]
> X2 <- u[, 2]
> Y <- X1 + X2 + rnorm(N)
> lm( Y ~ X1 + X2 ) %>% summary()

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -3.3380     2.1116  -1.581    0.132    
X1            1.1514     0.2104   5.472 4.13e-05 ***
X2            1.1963     0.2104   5.686 2.68e-05 ***

Now if we change the correlation to 0.95:

> set.seed(1)
> N <- 20
> rho <- 0.95
> Sigma <- matrix(c(1, rho, rho, 1), 2, 2)
> 
> u <- mvrnorm(n = N, c(10, 10), Sigma, empirical = TRUE)
> 
> X1 <- u[, 1]
> X2 <- u[, 2]
> 
> Y <- X1 + X2 + rnorm(N)
> 
> lm( Y ~ X1 + X2 ) %>% summary()

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)  -2.9106     1.8540  -1.570   0.1349  
X1            1.0814     0.5836   1.853   0.0813 .
X2            1.2235     0.5836   2.097   0.0513 .
Robert Long
  • 53,316
  • 10
  • 84
  • 148
  • Thank you! Does that mean there are no insignificant predictors (since none of the variables have a strong correlation, then none of them would do a good of representing the other)? I believe removing insignificant predictors reduces collinearity, correct? – air_nomad Oct 09 '20 at 16:38
  • 1
    Removing *any* predictor reduces multicollinearity. Focusing on individual p-values for model-selection (which, when conducted in a principled way, is called *stepwise selection*) often doesn't work as well as many other methods. – whuber Oct 09 '20 at 16:45
  • Unless you have extremely high correlations, you shouldn't have a problem with multicollinearity. Correlations are to be expected with real world, observational data. Only in (well conducted) controlled experiments might you find no correlations between predictors. If you are building a model it is best to do so using knowledge of all the variables, if at all possible. As mentioned by @whuber, stepwise procedures are not very good, generally speaking, especially if you are trying to *understand* the data. – Robert Long Oct 09 '20 at 17:00