Lasso features selection through Crossvalidation

Question

I took a method to determine the best features for a classifier from somewhere and I wanted to ask if it is reliable or not:

I was trying to, very simply, select the best features for a Logistic Regression classifier.

Using Lasso().coef_ it turned out that everytime I switched the training and testing data indexs (with a random ShuffleSplit, so in a CrossValidation process), the 0 lasso coefs changed. So, to visualize it:

Loop1    
random_seed=1
Lasso_coefs_obtained = [0,   0.9, 0,  0,   0.1]

Loop2
random_seed=2
Lasso_coefs_obtained = [0.3, 0,   0,  0.7, 0  ]

Loop3
random_seed=3
Lasso_coefs_obtained = [0.1, 0.8, 0,  0.2, 0.2]

In order to determine if a feature coefficient was really null, I would sum the three coefficients lists and determine the 0 elements (which must be those elements that are 0 in the three lists).

Sum_coefs = [0.4, 1.7, 0,  0.9, 0.3]

In my case, I did this method with 10 loops (I have 10 coefficients lists over 72 variables). After 10 loops, only 55 coefficients remain non-zero. So, I delated 17 features for the classifier.

Is this method reliable or I'm just making up an algorithm with no guarantees?

score 1 · Accepted Answer · answered Sep 02 '18 at 09:00

Clarifying your approach

the 0 lasso coefs changed

What value of $\lambda$ (the Lasso parameter) did you use and how did you determine it ? Your approach seems confusing:

Have you performed k-fold cross validation on your training set for a range of $\lambda$ values and chosen the $\lambda$ that gives the best results ?
I would suggest starting with cross validation without replacement so as to avoid a kind of bootstrapping effect (i.e. careful with the shufflesplit)

A good introduction to combining Lasso and cross validation is provided by the inventor of the Lasso, Robert Tibshirani, pages 15 and 16 here. Also here: Lasso cross validation and here

Some reasons for why Lasso coefficients would be different

Wrong approach: if you only use the default $\lambda$ value of your algorithm (e.g. from R or Sklearn), mix up the data set with replacement and perform Lasso you are likely going to obtain different parameter values.
Too few data points: if your data set is too small, or if $K$ of the K-fold CV is too large, your model will be fitted on a data set which is not representative of the overall data set, which would explain different results across the folds
More features than data points: CV and Lasso are unstable in this case
Unstable model or extreme collinearity: If you use a standard approach to select features but still get different coefficients, my intuition is that your features are so highly correlated that they cause the algorithm to struggle / encounter numerical issues / become unstable

Some additional sources on the topic:

An entire thesis on Lasso and CV instability here
https://www.sciencedirect.com/science/article/pii/S016794731300323X

Bootstrapping

Provided that none of the above issues apply, bootstrapping your data set and performing standard k-fold CV can still be useful

You will need to repeat the boostrap experiment hundreds or thousands of times (not 10)
You can perform statistical inference on the bootstrap results such as a bootstrapped confidence intervals, statistical significance tests etc..
Careful when interpreting these results however, as bootstrap inference has its own limitations, bias and interpretation issues (a non trivial topic - look on stackexchange)

Lasso features selection through Crossvalidation

1 Answers1

Clarifying your approach

Some reasons for why Lasso coefficients would be different

Bootstrapping

Linked

Related