3

It is known that LASSO can be used for feature selection.
How can I know if the model is reliable for that purpose?

In general the model's accuracy, R squared and etc, don't bother me because I don't use it for prediction.

But for example, if the model's accuracy is 0.5, is it a good idea to use it for dropping variables?

(After the feature selection process, I will choose a classification model and train it on the data, and in that step, the predictor accuracy is important for me)

Amit S
  • 27
  • 7
  • 1
    If prediction is not your goal then what is your goal? Why do you want to "select" features, what will you do with them? – user2974951 Jan 04 '22 at 10:06
  • After the feature selection process, I will choose a classification model and train it on the data, and in that step, the predictor accuracy is important for me – Amit S Jan 04 '22 at 10:10
  • 1
    [Harrell argues that LASSO is unlikely to select the correct features.](https://stats.stackexchange.com/questions/411035/does-lasso-suffer-from-the-same-problems-stepwise-regression-does) // Why do feature selection with LASSO instead of running the LASSO regression and using that model? – Dave Jan 04 '22 at 12:33
  • 1
    You want to look into conditions under which LASSO (or some variant of it -- Relaxed, Adaptive, squared-root LASSO, etc) can achieve consistent variable selection. If your data (or the assumed data generating process) satisfies them, LASSO can be reliable. – runr Jan 04 '22 at 12:49
  • If you eventually are looking for classificatiion (categorial response), why do you try to base feature selection on a linear prediction model (continuous metric response)? What is the response variable used for the LASOO model? – cdalitz Jan 04 '22 at 13:30

2 Answers2

11

Many analysts automatically assume that feature selection is a good idea. This never followed. Parsimony is the enemy of predictive discrimination. Perhaps more important, feature selection, whether using lasso or other methods, is unreliable. The way to tell if lasso is good enough is to test its resilience/stability using the bootstrap. The bootstrap will also inform you of how difficult it is to choose the penalty parameter $\lambda$ for the lasso as you'll probably see much different $\lambda$ selected over multiple resamples. For each bootstrap resample find the list of lasso-selected nonzero predictor coefficients and see how they vary. Also compute confidence intervals for the ranks of variable importance; you'll see these intervals are wide, exposing the difficulty of the task. These are discussed in one of the final chapters in BBR.

Here is an example showing how poorly the lasso works in the best of situations where predictors are uncorrelated and the distribution of true unknown regression coefficients follows a Laplace distribution, which is what the lasso penalty is optimized for. You see that frequently variables are selected that have very small true coefficients and variables are not selected that have large coefficients. This is from here.

probability of lasso selecting a variable as a function of its true importance

Frank Harrell
  • 74,029
  • 5
  • 148
  • 322
  • How heavily does the simulation results depend on the variance of the error term? – runr Jan 05 '22 at 19:49
  • 1
    I'm not able to find the code in the provided link, but during straightforward simulations with reasonable error variance I'm getting much better selected proportions. It's clear that any signal can be killed with large enough errors for fixed $n,p$, but it's an interesting question of how large signal to noise ration can we tolerate for adequate results, and how does that change with $p,n$ increasing. – runr Jan 05 '22 at 20:15
  • The error variance in my example is better than what I see in medical applications, where the signal:noise ratio is usually not very good. If you are doing simple things like image recognition then every method works better (your $R^2$ is very high then). – Frank Harrell Jan 05 '22 at 21:18
  • Thanks for the response, very interesting. – runr Jan 05 '22 at 22:45
  • Simulate data where the true $R^2$ is about 0.25. – Frank Harrell Jan 05 '22 at 22:47
1

Lasso is a common regression technique for variable selection and regularization. By defining many cross validation folds and playing with different values of $\alpha$, you can find the best set of beta coefficients which confidently predicts your outcome without overfitting or underfitting. If the Lasso technique has assigned the beta coefficient of any covariates to 0, you can either chose to drop these features since they do not contribute to the predictor or proceed in the knowledge that those covariates are essentially meaningless.

As such, consider a sample consisting of N observations, each with p covariates and a single outcome (typically the case for most regression problems). Essentially the objective of Lasso is to solve:

$ \min_{ \beta_0, \beta } \left\{ \sum_{i=1}^N (y_i - \beta_0 - x_i^T \beta)^2 \right\} \text{ subject to } \sum_{j=1}^p |\beta_j| \leq \alpha. $

Here $ \beta_0 $ is the constant coefficient, $ \beta:=(\beta_1,\beta_2,\ldots, \beta_p)$ is the coefficient vector, and $\alpha$ is a prespecified free parameter that determines the degree of regularization.

Jay Ekosanmi
  • 561
  • 1
  • 10