Feature selection and cross validation

Question

I'm working on a project and I would like to know if the following strategy is good/correct.

The input is a dataset with 2.500 features and 1.000 instances. I have to apply a feature selection on this set. In the beginning I randomly make a learning set (which contains 70% of the original data) and a test set (with the remaining 30%).

Now I split the learning set in 10 folds (to do cross-validation), so we've split the learning set into 10 (more or less equal) parts and now we take 9 parts out of 10 to do a feature ranking (using chi squared), then we take the best 10 features (highest ranking) to do a model evaluation (using SVM) on the remaining part of the 'splitted learning set' (and we do this 10 times, so each part will be used as a 'test set').

Doing this '10 folds' argument gives us 10 feature rankings (with 10 model evaluation/accuracies). Out of these 10 results we can take the best feature ranking (according to it's accuracy based on the model or to take the sum of the feature rankings) and then finally test it on the 30% test set from the beginning.

Please give your opinion whether this is good or really stupid...

score 2 · Accepted Answer · edited Apr 13 '17 at 12:44

Cross validation is used to select your model. The out-of-sample error can be estimated from your validation error. Usually this validation error is the mean value of your ten validation errors. Please note that the model here not only means the feature number, but also refers to your function model (whether it is y=ax1+bx2+c or y=ax1^2+bx2+cx1x2+d...etc.), and it mainly means the latter. Yet after the validation process you may have a rough idea of what features are of more importance. Once your model is determined, use all your data for the training, and use extra new data for test. You can still keep all the features (or most features if storage and computing is not a huge issue), just adding more regularization parameters to those features that contribute evidently less than others. The learning process will hopefully 'eliminate' those less-contribution-features.

Here is another thread discussion on the similar topic with more details and various opinions, for your reference: Feature selection for "final" model when performing cross-validation in machine learning

Feature selection and cross validation

1 Answers1