When does LASSO select correlated predictors?

Question

I'm using the package 'lars' in R with the following code:

> library(lars)
> set.seed(3)
> n <- 1000
> x1 <- rnorm(n)
> x2 <- x1+rnorm(n)*0.5
> x3 <- rnorm(n)
> x4 <- rnorm(n)
> x5 <- rexp(n)
> y <- 5*x1 + 4*x2 + 2*x3 + 7*x4 + rnorm(n)
> x <- cbind(x1,x2,x3,x4,x5)
> cor(cbind(y,x))
            y          x1           x2           x3          x4          x5
y  1.00000000  0.74678534  0.743536093  0.210757777  0.59218321  0.03943133
x1 0.74678534  1.00000000  0.892113559  0.015302566 -0.03040464  0.04952222
x2 0.74353609  0.89211356  1.000000000 -0.003146131 -0.02172854  0.05703270
x3 0.21075778  0.01530257 -0.003146131  1.000000000  0.05437726  0.01449142
x4 0.59218321 -0.03040464 -0.021728535  0.054377256  1.00000000 -0.02166716
x5 0.03943133  0.04952222  0.057032700  0.014491422 -0.02166716  1.00000000
> m <- lars(x,y,"step",trace=T)
Forward Stepwise sequence
Computing X'X .....
LARS Step 1 :    Variable 1     added
LARS Step 2 :    Variable 4     added
LARS Step 3 :    Variable 3     added
LARS Step 4 :    Variable 2     added
LARS Step 5 :    Variable 5     added
Computing residuals, RSS etc .....

I've got a dataset with 5 continuous variables and I'm trying to fit a model to a single (dependent) variable y. Two of my predictors are highly correlated with each other (x1, x2).

As you can see in the above example the lars function with 'stepwise' option first chooses the variable that is most correlated with y. The next variable to enter the model is the one that is most correlated with the residuals. Indeed, it is x4:

> round((cor(cbind(resid(lm(y~x1)),x))[1,3:6]),4)
    x2     x3     x4     x5 
0.1163 0.2997 0.9246 0.0037

Now, if I do the 'lasso' option:

> m <- lars(x,y,"lasso",trace=T)
LASSO sequence
Computing X'X ....
LARS Step 1 :    Variable 1     added
LARS Step 2 :    Variable 2     added
LARS Step 3 :    Variable 4     added
LARS Step 4 :    Variable 3     added
LARS Step 5 :    Variable 5     added

It adds both of the correlated variables to the model in the first two steps. This is the opposite from what I read in several papers. Most of then say that if there is a group of variables among which the correlations are very high, then the 'lasso' tends to select only one variable from the group at random.

Can someone provide an example of this behavior? Or explain, why my variables x1, x2 are added to the model one after another (together) ?

This is least angle regression which gives an explanation of the lasso steps. — Michael R. Chernick, Jun 14 '12 at 23:17
@MichaelChernick: If you look at the `R` call the OP is making and the associated output he provides, you will see that he is, indeed, using the lasso. As I'm sure you know, a small tweak of the lars algorithm yields the lasso regularization path. — cardinal, Jun 14 '12 at 23:22
My "guess" is that, since x2 includes 4 units of x1, x1 and x2 combined actually have the most variance(9 units). If you lower the coefficient of x2 to 2, you should see that x4 is selected before x1 and x2. — , Jan 15 '13 at 00:13
Can you provide some references for the proof of that "randomness"? Thank you. — ziyuang, May 03 '13 at 19:46
I guess you can find your answer on this paper: http://arxiv.org/pdf/1204.1605.pdf — TPArrow, Jul 18 '14 at 09:39

score 19 · Answer 1 · answered Aug 22 '14 at 18:22

The collinearity problem is way overrated!

Thomas, you articulated a common viewpoint, that if predictors are correlated, even the best variable selection technique just picks one at random out of the bunch. Fortunately, that's way underselling regression's ability to uncover the truth! If you've got the right type of explanatory variables (exogenous), multiple regression promises to find the effect of each variable holding the others constant. Now if variables are perfectly correlated, than this is literally impossible. If the variables are correlated, it may be harder, but with the size of the typical data set today, it's not that much harder.

Collinearity is a low-information problem. Have a look at this parody of collinearity by Art Goldberger on Dave Giles's blog. The way we talk about collinearity would sound silly if applied to a mean instead of a partial regression coefficient.

Still not convinced? It's time for some code.

set.seed(34234)

N <- 1000
x1 <- rnorm(N)
x2 <- 2*x1 + .7 * rnorm(N)
cor(x1, x2) # correlation is .94
plot(x2 ~ x1)

I've created highly correlated variables x1 and x2, but you can see in the plot below that when x1 is near -1, we still see variability in x2. enter image description here

Now it's time to add the "truth":

y <- .5 * x1 - .7 * x2 + rnorm(N) # Data Generating Process

Can ordinary regression succeed amidst the mighty collinearity problem?

summary(lm(y ~ x1 + x2))

Oh yes it can:

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.0005334  0.0312637  -0.017    0.986    
x1           0.6376689  0.0927472   6.875 1.09e-11 ***
x2          -0.7530805  0.0444443 -16.944  < 2e-16 ***

Now I didn't talk about LASSO, which your question focused on. But let me ask you this. If old-school regression w/ backward elimination doesn't get fooled by collinearity, why would you think state-of-the-art LASSO would?

To your last point, while OLS is rotationally equivariant, I would not expect the same property from LASSO because of the $L_1$ norm. Whether LASSO is fancy or not is probably not relevant. — steveo'america, Jun 27 '17 at 23:49
The idea was that simpler concepts could be used to explain OP's described phenomenon, and that these concepts are not fundamentally altered by the addition of a data-driven regularization term. — Ben Ogorek, Jun 28 '17 at 04:28

vtshen · Answer 2 · 2016-07-23T13:27:16.583

Ben's answer inspired me to go one step further on the path he provided, what will happen if the "truth", y, is in other situations.

In the original example, y is dependent on the two highly correlated variables x1 and x2. Assuming there is another variable, x3, say

x3 = c(1:N)/250 # N is defined before, N = 1000, x3 is in the similar scale as x1, and the scale of x3 has effects on the linear regression results below.

The "truth" y is now defined as follow

y = .5 * x1 - .7 * x3 + rnorm(N) # Data Generating Process

What would happen to the regression?

summary(lm(y ~ x1 + x2))

There exists strong collinearity effect. The standard error of x2 is too large. However, the linear regression identifies x2 as a non-significant variable.

     Coefficients:
        Estimate Std. Error t value Pr(>|t|)    
(Intercept) -1.39164    0.04172 -33.354  < 2e-16 ***
x1           0.65329    0.12550   5.205 2.35e-07 ***
x2          -0.07878    0.05848  -1.347    0.178

vif(lm(y ~ x1 + x2))

x1       x2 
9.167429 9.167429

What about another regression case?

summary(lm(y ~ x1 + x2 + x3))

Coefficients:
        Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.02100    0.06573   0.319    0.749    
x1           0.55398    0.09880   5.607 2.67e-08 ***
x2          -0.02966    0.04604  -0.644    0.520    
x3          -0.70562    0.02845 -24.805  < 2e-16 ***

The variable x2 is not significant, and recommended to be removed by the linear regression.

vif (lm(y ~ x1 + x2 + x3))

x1       x2       x3 
9.067865 9.067884 1.000105

From above results, the collinearity is not a problem in linear regression, and checking VIF is not very helpful.

Let's look at another situation. x3 = c(1:N) # N is defined before, N = 1000, x3 is not in the same scale as x1.

The "truth" y is defined the same as above

y = .5 * x1 - .7 * x3 + rnorm(N) # Data Generating Process

What would happen to the regression?

summary(lm(y ~ x1 + x2))

There exists strong collinearity effect. The standard errors of x1, x2 are too large. The linear regression fails to identify the important variable x1.

   Coefficients:
        Estimate Std. Error t value Pr(>|t|)    
(Intercept) -350.347      6.395 -54.783   <2e-16 ***
x1            25.207     19.237   1.310    0.190    
x2           -12.212      8.963  -1.362    0.173

vif(lm(y ~ x1 + x2))

    x1       x2 
9.167429 9.167429

What about another regression case?

summary(lm(y ~ x1 + x2 + x3))

Coefficients:
          Estimate Std. Error   t value Pr(>|t|)    
(Intercept)  0.0360104  0.0610405     0.590    0.555    
x1           0.5742955  0.0917555     6.259 5.75e-10 ***
x2          -0.0277623  0.0427585    -0.649    0.516    
x3          -0.7000676  0.0001057 -6625.170  < 2e-16 ***

The variable x2 is not significant, and recommended to be removed by the linear regression.

vif (lm(y ~ x1 + x2 + x3))

x1       x2       x3 
9.182507 9.184419 1.001853

Note: the regression of y on x1 and x3. Notice that the standard error of x1 is only 0.03.

summary(lm(y ~ x1 + x3))

Coefficients:
              Estimate Std. Error   t value Pr(>|t|)    
(Intercept) -0.1595528  0.0647908    -2.463    0.014 *  
x1           0.4871557  0.0321623    15.147   <2e-16 ***
x3          -0.6997853  0.0001121 -6240.617   <2e-16 ***

Based on above results, my conclusion is that

when the predictor variables are in the similar scales, the collinearity is not a problem in linear regression;
when the predictor variables are not in the similar scales,
- when the two highly correlated variables are both in the true model, the collinearity is not a problem;
- when only one of the two highly correlated variables is in the true model,
  - If the other "true" variables are included in the linear regression, the linear regression will identify the non-significant variables that are correlated with the significant variable.
  - If the other "true" variables are not included in the linear regression, the problem of collinearity is severe, resulting in standard error inflation.

Interesting, although these results assume linear relationships between the predictors/features and y. They are far from comprehensive. What happens if there are strong non linear relationships in your predictors (e.g. interaction terms x1 * x2, step function features/dummy vars (1 if x1 > c for some constant), etc)? If you work with low signal to noise ratio data, like in feature creation for algorithmic trading, you always parsimonious models to reduce overfitting (because your signals are weak) so there are still strong reasons to deal w multicollinearity — FXQuantTrader, Dec 14 '16 at 04:27

When does LASSO select correlated predictors?

2 Answers2

Linked