Linear Regression Feature selection: multiple-regression p-value filter versus Lasso / Recursive Feature Elimination

Question

I have been thinking of this problem for days and I can't seem to arrive at a conclusion for feature selection in Linear Regression.

Please tell me what is wrong with this simple approach versus using more sophisticated ones like Lasso, stability, or Recursive Feature Elimination:

Include all features in StatsModel OLS --> Remove all features whose p-values are greater than 0.05 (arbitrary alpha level) --> Remaining ones are my features and I'm done.

Why use fancy algorithms like Lasso, RandomizedLasso for stability checking, and RFE at all? What am I missing?

This has been discussed extensively multiple times before. Please see older posts on [tag:feature-selection] etc. — Richard Hardy, Sep 12 '16 at 08:49
See also [To select variables or not in logistic regression](http://stats.stackexchange.com/questions/202121/to-select-variables-or-not-in-logistic-regression), and [Should covariates that are not statistically significant be 'kept in' when creating a model?](http://stats.stackexchange.com/q/66448/22228) — Silverfish, Sep 12 '16 at 09:13

score 1 · Answer 1 · answered Dec 25 '16 at 20:35

Feature significance does not equal to an increased cross-validation score. I've been researching this quite extensively lately. I've had many significant features reduce my cross-validation score, and have had some non significant features increase CV score on K-folds cross validation. It all depends on what you want to achieve, if its a valid CV score, then you'll need to proceed with other methods.

Linear Regression Feature selection: multiple-regression p-value filter versus Lasso / Recursive Feature Elimination

1 Answers1