1

I have been thinking of this problem for days and I can't seem to arrive at a conclusion for feature selection in Linear Regression.

Please tell me what is wrong with this simple approach versus using more sophisticated ones like Lasso, stability, or Recursive Feature Elimination:

Include all features in StatsModel OLS --> Remove all features whose p-values are greater than 0.05 (arbitrary alpha level) --> Remaining ones are my features and I'm done.

Why use fancy algorithms like Lasso, RandomizedLasso for stability checking, and RFE at all? What am I missing?

Silverfish
  • 20,678
  • 23
  • 92
  • 180
Heavy Breathing
  • 431
  • 5
  • 8
  • 3
    This has been discussed extensively multiple times before. Please see older posts on [tag:feature-selection] etc. – Richard Hardy Sep 12 '16 at 08:49
  • 1
    See also [To select variables or not in logistic regression](http://stats.stackexchange.com/questions/202121/to-select-variables-or-not-in-logistic-regression), and [Should covariates that are not statistically significant be 'kept in' when creating a model?](http://stats.stackexchange.com/q/66448/22228) – Silverfish Sep 12 '16 at 09:13

1 Answers1

1

Feature significance does not equal to an increased cross-validation score. I've been researching this quite extensively lately. I've had many significant features reduce my cross-validation score, and have had some non significant features increase CV score on K-folds cross validation. It all depends on what you want to achieve, if its a valid CV score, then you'll need to proceed with other methods.