Should feature selection be done on the same data that you train your model on?

Question

I know that you should separate your data into training and validation sets before doing feature selection, to avoid getting too good, false results on cross validation.

But, I have seen people say that you should also avoid doing feature selection on the same data set that you train your model on, as to avoid overfitting on that data.

What some suggest is that you split your training data into 3 sets, a training, validation and testing dataset. You train your model on the training dataset, do feature selection on the validation set, and evaluate your model on the testing dataset.

Is this overfitting really something to worry about, and if yes, is the above method a good way to deal with it?

Feature selection is generally performed *before* training a model, not after, in order to eliminate noisy or uninformative features that can confound the algorithm. Some classification methods are implicitly feature-selective, and for those it's not possible to separate the feature selection and model-building steps. — Nuclear Hoagie, Jul 16 '18 at 12:49
maybe what you mean is model selection? Then everything makes sense. https://stats.stackexchange.com/questions/19048/what-is-the-difference-between-test-set-and-validation-set — Xiaoxiong Lin, Jul 16 '18 at 13:49
@XiaoxiongLin No, I meant feature selection. I really don't know how could I explain the question any further. — Ian Dzindo, Jul 16 '18 at 17:00
Can you specify an example of feature selection? I'm not an expert, but if you mean e.g. determining lambda for LASSO is a way of feature selection, it could lead to overfitting. — Xiaoxiong Lin, Jul 17 '18 at 13:38
@XiaoxiongLin Feature selection as in SelectKBest or SelectPercentile. — Ian Dzindo, Jul 18 '18 at 06:59
@IanDzindo I see. In these two methods, you need to specify k or percentile. For choosing these hyper parameters, at least for this reason, you should use a separate validation set. Furthermore, even if you fix these hyper parameters in the beginning for some reason, still as you've said in your question, it will lead to overfitting. — Xiaoxiong Lin, Jul 18 '18 at 08:17
@IanDzindo You do feature selection to avoid overfitting, i.e. preventing model from giving too much weight to those features that are only important for the training set. These two methods are not like PCA, they are supervised feature selection methods. They essentially just choose those features that are more predictable of the label. If you train and do feature selection on the same dataset, it only helps very little. Moreover, I think if your model is something like LDA, it doesn't help at all. — Xiaoxiong Lin, Jul 18 '18 at 08:28
@XiaoxiongLin Thank you very much for your help, I think I finally understand what's going on. — Ian Dzindo, Jul 20 '18 at 06:18

Should feature selection be done on the same data that you train your model on?

0 Answers0