Combining results of multiple Lasso runs / Variable selection

Question

I would appreciate your opinion on an analysis approach I have in mind. The idea is to do the variable selection with multiple runs of Lasso regression (by glmnet in R). Basically, the workflow would be:

Run Lasso in the usual classification settings - train/test 70/30 split and do so 1000 times
For each variable, count the number of times it had a non-zero coefficient, i.e. it was chosen
For further analysis, take the variables that appeared in (for example) more than 80% of the models, i.e. that have a count>800.

Is approach somehow invalid (maybe because the resulting set of variables would never be a complete set of variables chosen by one of the 1000 Lasso regressions)?

If someone knows a paper in which I can read more about it, I'd be grateful for the link...

Thank you!

score 1 · Answer 1 · answered Sep 29 '19 at 17:05

Two cautions come to mind.

First, as ISLR notes on page 224:

In general, one might expect the lasso to perform better in a setting where a relatively small number of predictors have substantial coefficients, and the remaining predictors have coefficients that are very small or that equal zero.

If that's not the case in your situation you can run into the following type of problem, particularly if you have many more potential predictors than cases. Say that 4 highly correlated predictors are all strongly related to outcome. Each might then only be chosen 1/4 of the time. So there's a risk that a high cutoff in Step 3 might miss important predictors. Note that a simple LASSO in this case would choose 1 of the 4 somewhat arbitrarily but that would still provide an aid to prediction on future cases. My understanding of "stability selection" (as noted in another answer) is that it controls false discovery of predictors fairly well, but it would seem to pose a risk of missing true positives in some settings.

Second, much depends on what you intend to do for "further analysis" after Step 3. For example, if you were simply to take the selected variables and build a regression model with them, ignoring the fact that you used the data to select those variables, then inference in terms of p-values and confidence intervals for regression coefficients would be incorrect. See Chapter 6 of Statistical Learning with Sparsity for an introduction to ways to approach inference in the context of data-based variable selection.

Thank you, that really helps. The idea is to describe clusters that I got, so I would like to get a subset of variables that are really important for a given cluster, i.e. label. The best way would be to obtain p-values somehow, but for Lasso that is again a separate research question. I checked this paper http://statweb.stanford.edu/~tibs/ftp/covtest.pdf . Do you know If I can actually use this approach with glmnet - I can get the coefficients for each lambda from lambda sequence that is returned by it so maybe calculating the statistic as per paper would then be the 'cleanest' thing to do? — pelah, Oct 01 '19 at 11:22

score 0 · Accepted Answer · answered Sep 29 '19 at 15:35

0

What you're describing is called "stability selection" and published in this paper: https://rss.onlinelibrary.wiley.com/doi/10.1111/j.1467-9868.2010.00740.x

answered Sep 29 '19 at 15:35

Edgar

1,391
2
7
25

Combining results of multiple Lasso runs / Variable selection

2 Answers2