CV on training set with feature selection

Question

I've got a problem with CV on feature selection. I've used a method, but I don't know it's correct...

I split my data into 70% training set and 30% test set
I work now with my training set. I do on my training set a 10-fold CV
On each fold, I use the training part to search for the minimal subset of features that maximizes the accuracy of the test part (in that fold)
Now I've got 10 subsets of features and I rank all the features from this subsets and check which minimal subset maximizes the accuracy of the 30% test set

I really don't know if this is correct, can someone help me?

score 1 · Answer 1 · answered Nov 23 '13 at 20:33

It is not; the whole point of having a test set is that you don't use it to optimize anything, so that the accuracy on this set is a reliable estimate of real accuracy.

If you need to extract some consensus subset of features over folds you either need some unsupervised method to do it (like a binomial test) or a one more layer of nested validation.

score 0 · Answer 2 · 2013-11-24T00:58:23.773

Feature selection should be done before CV. If you select features during CV, then they will change depending on the training data -- there are techniques that exploit this but at the beginner level, you should first consider selecting features before CV.

Splitting the data into a fixed portion for training and testing is also inefficient.

Instead do this:

Select features that best predict class membership or best predict the function using the entire dataset. Note, I always like to use a separate feature $filtering$ method to identify informative features prior to and separately from classification, in order to minimize selection bias. Recurrent feature selection or $wrapping$ uses the classifier to select features -- and commonly has greater risk for selection bias, so filtering is better (less biased). Separating the feature selection filtration from the classification step is very beneficial when generalizing results from future data not used for training/testing -- so I always keep the two ``far removed'' from one another. (that is, I don't want the classifier to select any features). Use, for example, statistical hypothesis tests (t-test, Mann-Whitney test, F-test, Kruskal-Wallis test), or information gain (entropy), or Gini index for feature filtration (selection).
Divide the objects uniformly into ten folds $\mathcal{D}_1, \mathcal{D}_1,\ldots,\mathcal{D}_{10}$
First, train with objects in the 9 folds $\mathcal{D}_2, \mathcal{D}_3,\ldots,\mathcal{D}_{10}$ and test objects with the trained system in fold $\mathcal{D}_1$.
Next, train with objects in the 9 folds $\mathcal{D}_1, \mathcal{D}_3,\ldots,\mathcal{D}_{10}$ and test objects with the trained system in fold $\mathcal{D}_2$.
Repeat the above up to a point where objects in fold $\mathcal{D}_{10}$ are tested and 9 folds $\mathcal{D}_1, \mathcal{D}_2,\ldots,\mathcal{D}_{9}$ are used for training.
For each object in each test fold, increment the confusion matrix $\mathbf{C}$ (with dimensions $\Omega \times \Omega$) with a one in element $c_{\omega,\hat{\omega}}$, where $\omega$ is the true class of the object and $\hat{\omega}$ is the predicted class. Thus

After each 10-fold CV, total accuracy for classification is the sum of the diagonal elements of $\mathbf{C}$, divided by the total number of objects, i.e., $Acc=\sum_\omega^\Omega c_{\omega\omega}/n$.

Note that the above methods are called a 10-fold CV. You should next $repartition$ the objects into 10 folds again but this time after randomly shuffling (permuting) the order of all objects, then repeat the above 10-fold CV. This will ensure that objects assigned to the folds are different. Repartition ten times, each time performing a 10-fold CV, then calculate total accuracy. This will then be called a ``ten 10-fold CV''.

Once you perform ten 10-fold CV, you can select features after, for example, mean-zero standardizing, normalizing into range [0,1], or fuzzifying. The key point is that, on average, classification accuracy will change with the features used. But first, get a handle on classification accuracy using the first group of features. Then, if you want to select features a different way (maybe after transforming their values), then run a complete ten 10-fold CV every time you change the features.

For accuracy determination following ten 10-fold CV, use $Acc=\sum_\omega^\Omega c_{\omega\omega}/ \sum_\omega^\Omega \sum_\hat{\omega}^\Omega c_{\omega,\hat{\omega}}$, which is equal to the sum of the diagonal elements of $\mathbf{C}$ divided by the sum of all elements of $\mathbf{C}$.

I don't think that is quite correct. Feature selection before cross-validation leads to biased classification performance: http://stats.stackexchange.com/questions/27750/feature-selection-and-cross-validation?rq=1 — BGreene, Nov 25 '13 at 10:25
Recall, objects are **first** permuted randomly into folds before performance assessment is made during the first 10-fold CV. When repartitioning (shuffling) objects a total of ten times with 10-fold CV after each, there is a greater chance of finding a fold with objects whose relationship with the features is poor. An alternative for avoiding performance bias (due to feature selection) would be to use random forests. If you also need to compare features across classifiers, then the features will need to be filtered independently of classification. — , Nov 26 '13 at 13:38
When not using random forests, I never let other classifiers ``get close'' to the feature selection process. I also never use one classifier (no free lunch theorem) or one feature set (ugly duckling theorem). Rather, I advocate ensemble classifier fusion, for which diversity is commonly a problem (see Kuncheva). — , Nov 26 '13 at 13:43
Well the problem then is that features used will have prior sight of *all* of the data and so the potential for bias still exists. I have found from experience that doing any kind of supervised feature selection prior to cross validation leads to over optimistic classifier performance results. — BGreene, Nov 26 '13 at 17:17

CV on training set with feature selection

2 Answers2