3

I took a method to determine the best features for a classifier from somewhere and I wanted to ask if it is reliable or not:

I was trying to, very simply, select the best features for a Logistic Regression classifier.

Using Lasso().coef_ it turned out that everytime I switched the training and testing data indexs (with a random ShuffleSplit, so in a CrossValidation process), the 0 lasso coefs changed. So, to visualize it:

Loop1    
random_seed=1
Lasso_coefs_obtained = [0,   0.9, 0,  0,   0.1]

Loop2
random_seed=2
Lasso_coefs_obtained = [0.3, 0,   0,  0.7, 0  ]

Loop3
random_seed=3
Lasso_coefs_obtained = [0.1, 0.8, 0,  0.2, 0.2]

In order to determine if a feature coefficient was really null, I would sum the three coefficients lists and determine the 0 elements (which must be those elements that are 0 in the three lists).

Sum_coefs = [0.4, 1.7, 0,  0.9, 0.3]

In my case, I did this method with 10 loops (I have 10 coefficients lists over 72 variables). After 10 loops, only 55 coefficients remain non-zero. So, I delated 17 features for the classifier.

Is this method reliable or I'm just making up an algorithm with no guarantees?

RubenS
  • 33
  • 3

1 Answers1

1

Clarifying your approach

the 0 lasso coefs changed

What value of $\lambda$ (the Lasso parameter) did you use and how did you determine it ? Your approach seems confusing:

  • Have you performed k-fold cross validation on your training set for a range of $\lambda$ values and chosen the $\lambda$ that gives the best results ?
  • I would suggest starting with cross validation without replacement so as to avoid a kind of bootstrapping effect (i.e. careful with the shufflesplit)

A good introduction to combining Lasso and cross validation is provided by the inventor of the Lasso, Robert Tibshirani, pages 15 and 16 here. Also here: Lasso cross validation and here

enter image description here


Some reasons for why Lasso coefficients would be different

  • Wrong approach: if you only use the default $\lambda$ value of your algorithm (e.g. from R or Sklearn), mix up the data set with replacement and perform Lasso you are likely going to obtain different parameter values.
  • Too few data points: if your data set is too small, or if $K$ of the K-fold CV is too large, your model will be fitted on a data set which is not representative of the overall data set, which would explain different results across the folds
  • More features than data points: CV and Lasso are unstable in this case
  • Unstable model or extreme collinearity: If you use a standard approach to select features but still get different coefficients, my intuition is that your features are so highly correlated that they cause the algorithm to struggle / encounter numerical issues / become unstable

Some additional sources on the topic:


Bootstrapping

Provided that none of the above issues apply, bootstrapping your data set and performing standard k-fold CV can still be useful

  • You will need to repeat the boostrap experiment hundreds or thousands of times (not 10)
  • You can perform statistical inference on the bootstrap results such as a bootstrapped confidence intervals, statistical significance tests etc..
  • Careful when interpreting these results however, as bootstrap inference has its own limitations, bias and interpretation issues (a non trivial topic - look on stackexchange)
Xavier Bourret Sicotte
  • 7,986
  • 3
  • 40
  • 72