I would appreciate your opinion on an analysis approach I have in mind. The idea is to do the variable selection with multiple runs of Lasso regression (by glmnet in R). Basically, the workflow would be:
- Run Lasso in the usual classification settings - train/test 70/30 split and do so 1000 times
- For each variable, count the number of times it had a non-zero coefficient, i.e. it was chosen
- For further analysis, take the variables that appeared in (for example) more than 80% of the models, i.e. that have a count>800.
Is approach somehow invalid (maybe because the resulting set of variables would never be a complete set of variables chosen by one of the 1000 Lasso regressions)?
If someone knows a paper in which I can read more about it, I'd be grateful for the link...
Thank you!