I have a question about the selection consistency of Lasso regression in a setting with two predictors, where the first predictor has a very strong influence and the second predictor has no influence on the outcome. I am interested in how often - depending on the sample size $n$ - Lasso sets the weight of the second predictor to zero. In R, this can be simulated very easily with the following code
library( glmnet )
p <- 2
n <- c( 50, 100, 250, 500, 1000, 2000, 3000, 5000, 10000, 20000 )
b <- c( 0.8, 0 )
nrep <- 500
results <- NULL
for ( nn in 1:length( n ) ) {
tmpN <- n[nn]
for ( ii in 1:nrep ) {
#- generate data:
X <- matrix( rnorm( tmpN*p, 0, 1 ), ncol = 2 )
y <- X%*%b + rnorm( tmpN, 0, 0.36 )
#- initialize glmnet:
fit <- glmnet( x = X, y = y, family="gaussian", type.measure = "mse",
alpha = 1 )
#- cross-validition:
fit.cv <- cv.glmnet( x = X, y = y, family="gaussian", grouped = FALSE,
type.measure = "mse", nfolds = 10, alpha = 1 )
#- compute lambda_min:
lambda_min <- fit.cv$lambda.min
est.lambda_min <- as.matrix( coef( fit, s = lambda_min ) )
#- make output:
results <- rbind( results, data.frame( n = tmpN, id = ii,
b1 = est.lambda_min[2], b2 = est.lambda_min[3],
lambda_min = lambda_min ) )
}
}
results$indb1 <- ifelse( results$b1 != 0, 1, 0 )
results$indb2 <- ifelse( results$b2 == 0, 1, 0 )
aggregate( results[,-c(1:2)], by = list( n = results$n ), mean )
where I used 10-fold cross-validation in each replication to determine $\lambda$, and then later simply counted how many times the zero coefficient was set to zero. This yields the following result:
n lambda_min indb1 indb2
1 50 0.021464939 1 0.500
2 100 0.017065004 1 0.542
3 250 0.011381177 1 0.570
4 500 0.008466896 1 0.570
5 1000 0.006601253 1 0.576
6 2000 0.005402042 1 0.580
7 3000 0.005052918 1 0.618
8 5000 0.004995445 1 0.716
9 10000 0.005084861 1 0.858
10 20000 0.005218356 1 0.964
You can see that with small samples the weight is set to 0 only in about half of the cases. The ratio then increases non-linearly and only becomes acceptable for very large $n$ (e.g. 80%).
I somehow have a hard time understanding this behavior. I know a theorem (e.g., Fan et al., 2020; Statistical foundations of data science) according to which the probability of zero coefficients being set to zero tends to 1 when $\sqrt{n}\lambda$ -> $\infty$. Also, I know a theorem that says that when $\lambda$ is of order $O(\sigma \sqrt{1/n * log p})$, then there is at least a chance of completely eliminating all predictors with zero weights, but I can't use these results to explain this pattern to me. Overall, the $\lambda$ chosen seems too small for the small samples for $b_2$ to be set to zero. Is it possible to estimate the probability of a zero coefficient being set to zero for fixed $p$ and $n$? Would one get better results with another method (adaptive lasso, scad)?
Thanks a lot! Stefan
PS: Literature references would also help me.