4

Is feature selection and training on the same sample a bad idea? I want to emphasize that I am not going to use test set for feature selection.

If I use the whole train set for feature selection and then for training the model then the training sample is larger. Any drawbacks of such approach?

doubts
  • 141
  • 2
  • Can you edit your post to comment a bit more on what exactly you are doing to chose features? You don't mention cross validation, so I would conclude that you are using the training error rate to select features. In this case, any time your model complexity increases, the training error will decrease by mathematical necessity (at least for the objective function of the optimization), so there are real and unavoidable issues with that approach. – Matthew Drury Oct 28 '15 at 14:57

1 Answers1

1

I think you can try to see these questions:

Can I perform an exhaustive search with cross-validation for feature selection?

Feature selection and cross-validation

It is talked about CV there, but it mostly same idea, when you reuse the data from train set for model fitting after feature selection algorithm it's basically like you train first simple algorithm with this data and then using it's results (via zeroing some non-random variables) again fit other model.

In real life it often can be not so criminal because with feature selection is not so easy to overfit unless you don't have model with millions noisy features.

morph
  • 11
  • 1