Feature selection in the training set

Question

I have a classifier, and I am using leave one out cross-validation to assess its performance.

On each iteration, I divide the dataset into training and testing sets. The testing set is just the subject that I am going to evaluate (leave one out).

Now, I divide the training set into folds, and I do feature selection like this:

I run my filter feature selection algorithm on every fold. When I am done, I have a voting algorithm to obtain the final set with the features that were selected in each fold.

I understand that this procedure is adequate when you have a small sample like in my case (subjects = 30, features = 960).

My question is why, if at all, would it be a bad idea to do feature selection on the whole training set instead of dividing it into folds?

I may be wrong, but when number of features is much larger than the number of subjects, classification will not make any sense whatsoever... — sashkello, Jul 19 '13 at 00:54
Yeah, you'll need some sort of penalty (elastic net or lasso) to estimate the model with more features than observations (see glmnet in R). — wcampbell, Jul 19 '13 at 01:09
This appears to be a duplicate question, see my answer to a previous question here http://stats.stackexchange.com/questions/27750/feature-selection-and-cross-validation/27751#27751 . Can you explain how your question differs from the earlier one? — Dikran Marsupial, Jul 19 '13 at 08:28
Answering to sashkello, you are right, I have more features than observations therefore I do feature selection. — Diego, Jul 19 '13 at 13:45
Wcampbell: I am having good classification results using my feature selection algorithm. I have solved the model estimation part. My question is more about the rational to use cross validation inside the training set to do feature selection. Can I take the whole training set at once instead of partitioning it into folds? — Diego, Jul 19 '13 at 13:48
Dikran: I am not doing feature selection on the whole dataset before dividing it into training and testing sets. This evidently causes overfitting becauses I would be allowing my testing subjects to participate in the feature selection process. In my case, I split the dataset first in training and testing and then I do feature selection on the training set ONLY. The question is about the appropriateness to do split the training set into folds and do the feature selection on each fold. Is this better than do feature selection on the whole training set? — Diego, Jul 19 '13 at 13:56
I'm confused. How can you do cross validation if you are using the whole training data set to fit the model? You don't want the validation set to be part of the data that were used to fit the model. — wcampbell, Jul 19 '13 at 15:17
wcampbell: I do training with the training set and validation with the testing set. Those are independent from each other. I edited the question for clarity. I am just dividing the training set into chunks to do feature selection. My question is about the validity of doing so. — Diego, Jul 19 '13 at 19:17

score 1 · Answer 1 · answered Jul 19 '13 at 01:06

The in-sample estimate of your prediction is generally overly optimistic. When you test your model on the same data on which it was trained, the model appears to fit very well. You get a more accurate estimate of your model's true predictive ability if you do cross validation - leave one out or k-fold. K-fold is much quicker than leave one out so that's what I use on larger samples.

By adding extra variables to the model, you always fit the sample data better. For example, in a linear regression, $R^{2}$ always increases when you add another variable to the model. That doesn't mean that adding that extra variable was a good idea. After adding too many variables, you end up over fitting the sample and don't do a good job of predicting new data. So, the main purpose of cross validation is to prevent over fitting the model and to ensure that you do the best possible job of prediction on an unseen data set.

I am not testing my classifier with the data I used for training. I am testing it with data it has been unseen by the classifier during the feature selection and training stages. That way I am avoiding overfitting. My question is about the implications of using the whole training set to select features instead of splitting the training set into folds and do feature selection on each fold. This is what I am doing right now and then I am voting for the best features on every fold. I need to clarify that my feature selection algorithm is a filter method. — Diego, Jul 19 '13 at 14:03

Feature selection in the training set

1 Answers1