1

I have a question about the selection consistency of Lasso regression in a setting with two predictors, where the first predictor has a very strong influence and the second predictor has no influence on the outcome. I am interested in how often - depending on the sample size $n$ - Lasso sets the weight of the second predictor to zero. In R, this can be simulated very easily with the following code

library( glmnet )

p <- 2
n <- c( 50, 100, 250, 500, 1000, 2000, 3000, 5000, 10000, 20000 )
b <- c( 0.8, 0 )
nrep <- 500

results <- NULL

for ( nn in 1:length( n ) ) {

  tmpN <- n[nn]

  for ( ii in 1:nrep ) {

    #- generate data:
    X <- matrix( rnorm( tmpN*p, 0, 1 ), ncol = 2 )
    y <- X%*%b + rnorm( tmpN, 0, 0.36 )

    #- initialize glmnet:
    fit <- glmnet( x = X, y = y, family="gaussian", type.measure = "mse",
      alpha = 1 )
  
    #- cross-validition:
    fit.cv <- cv.glmnet( x = X, y = y, family="gaussian", grouped = FALSE,
      type.measure = "mse", nfolds = 10, alpha = 1 )
  
    #- compute lambda_min:
    lambda_min <- fit.cv$lambda.min
    est.lambda_min <- as.matrix( coef( fit, s = lambda_min ) )

    #- make output:
    results <- rbind( results, data.frame( n = tmpN, id = ii,
      b1 = est.lambda_min[2], b2 = est.lambda_min[3], 
      lambda_min = lambda_min ) )

  }

}

results$indb1 <- ifelse( results$b1 != 0, 1, 0 )
results$indb2 <- ifelse( results$b2 == 0, 1, 0 )
aggregate( results[,-c(1:2)], by = list( n = results$n ), mean )

where I used 10-fold cross-validation in each replication to determine $\lambda$, and then later simply counted how many times the zero coefficient was set to zero. This yields the following result:


       n  lambda_min indb1 indb2
1     50 0.021464939     1 0.500
2    100 0.017065004     1 0.542
3    250 0.011381177     1 0.570
4    500 0.008466896     1 0.570
5   1000 0.006601253     1 0.576
6   2000 0.005402042     1 0.580
7   3000 0.005052918     1 0.618
8   5000 0.004995445     1 0.716
9  10000 0.005084861     1 0.858
10 20000 0.005218356     1 0.964


You can see that with small samples the weight is set to 0 only in about half of the cases. The ratio then increases non-linearly and only becomes acceptable for very large $n$ (e.g. 80%).

I somehow have a hard time understanding this behavior. I know a theorem (e.g., Fan et al., 2020; Statistical foundations of data science) according to which the probability of zero coefficients being set to zero tends to 1 when $\sqrt{n}\lambda$ -> $\infty$. Also, I know a theorem that says that when $\lambda$ is of order $O(\sigma \sqrt{1/n * log p})$, then there is at least a chance of completely eliminating all predictors with zero weights, but I can't use these results to explain this pattern to me. Overall, the $\lambda$ chosen seems too small for the small samples for $b_2$ to be set to zero. Is it possible to estimate the probability of a zero coefficient being set to zero for fixed $p$ and $n$? Would one get better results with another method (adaptive lasso, scad)?

Thanks a lot! Stefan

PS: Literature references would also help me.

Richard Hardy
  • 54,375
  • 10
  • 95
  • 219
Stefan
  • 45
  • 3
  • 1
    See https://stats.stackexchange.com/questions/559159 – Frank Harrell Jan 04 '22 at 13:29
  • For the last question, Hastie et al note in the legend to Figure 2.5 of [Statistical Learning with Sparsity](https://web.stanford.edu/~hastie/StatLearnSparsity/) that "The [nonnegative] garrote shrinks smaller values of $\beta$ more severely than lasso, and the opposite for larger values." In the associated text: "There is also a close relationship between the nonnegative garrote and the _adaptive lasso_." Thus one might do better than lasso with those other methods. The wisdom of using lasso (or any strict predictor selection) with small numbers of predictors, however, is open to question. – EdM Jan 11 '22 at 19:01

0 Answers0