Suspecting overfitting after feature-selection. Need your opinion and/or solution

Question

I'm currently trying to train a classifier with very few data points (41, 3 classes, supervised). The dataset is peculiar, so I also have to do a lot of feature engineering.

In order to evaluate my features (I have 32, but some of them may be redundant), to see which set of features is the best and to verify that the few data points I have are enough to train a decent classifier, I am doing a step of feature selection before training a naive bayes (the only classifier that works well with logistic regression). What bother me is this (I'm using Weka):

When I use a wrapper with Naive Bayes with "best first" search method, the best features set has features labelled 1, 4, 6, 7, 9, 13, 18, 29. With these features, I get 82% accuracy with 10-fold cross-validation.
When I use filters like "Correlation", "GainRatio" or "InfoGain" and rank them, these features do not get an especially high rank. When I try to use better ranked features, the accuracy drops to 60/70% accuracy max.
My major trouble is that features 4 and 6 have a score of 0 with GainRatio and InfoGain. To me, it means that these features bring absolutely no information to the classification problem, and that they were chosen by the wrapper just because they worked well by luck in this training context, although they are irrelevant.
To check that, I added some (20) random variables to the original feature set and did the same feature selection. The wrapper selected 2 random variables out of the 5 features selected, and GainRatio and InfoGain gave a score of 0 to all random variables. It seems to confirm what I said earlier. However, this set gives me poor accuracy (75%) compared to the one selected by the wrapper before the random variables were added to the feature set (which probably means that this method does not test every combination, so, tell me if I'm wrong, and how to reduce the risk of overfitting).

Is my training overfitting, or is the 82% accuracy a reliable score?

Should I delete the features 4 and 6 because of InfoGain/GainRatio, or are these two metrics not ultra-reliable? How can I be certain of that?

If it is overfitting, could you recommend me a methodology to do features selection without overfitting risk?

Thanks, have a good day.

The amount of bold text in this question is a little out of proportion... — Thomas Wagenaar, May 11 '17 at 12:17

Antimony · Accepted Answer · 2017-05-15T17:55:01.800

It would help to know what's the nature of your data, to be able to say whether your featurespace is too large or not. For example, for simple object classification tasks, 32 is quite a high number of features (also known as the dimension of the featurespace), but for image data, this may not be true. In general 41 is not enough data points for anything but very simple classification tasks. Read about the Curse of Dimensionality. The short version is that while a higher dimensionality captures more fine-grained data, it also needs more data points to learn from, otherwise it ends up overfitting to the training set.

Secondly, it is good to know what the best-first search strategy does (from this page):

Searches the space of attribute subsets by greedy hillclimbing augmented with a backtracking facility. Setting the number of consecutive non-improving nodes allowed controls the level of backtracking done. Best first may start with the empty set of attributes and search forward, or start with the full set of attributes and search backward, or start at any point and search in both directions (by considering all possible single attribute additions and deletions at a given point).

It does not compute all possible subsets, but selects features in a greedy manner, adding them to a subset (or deleting them from the set) if it improves accuracy. You can try setting the feature selection method to Exhaustive, to perform a complete search of all feature subsets.

Since you performed 10-fold cross validation, and if you are indeed reporting the average error across all 10 folds, it doesn't immediately indicate overfitting. But then again, you have very few data points, and it is possible the splits were not well done.

Let us assume for the moment that it is not overfitting. Then why might you see a low info gain/correlation score? Here is an example of a simple classification task:

Outlook  Temperature    Humidity    Windy   Play
Sunny    Hot            High        False   No
Sunny    Hot            Normal      True    No
Rainy    Mild           High        False   Yes
Rainy    Hot            High        True    No
Rainy    Cool           Normal      True    No

You can quickly eyeball this and observe that Windy's value doesn't have much to do with whether you go out to Play or not. In this case, the info gain of Windy as a feature may be abysmally low. But taken in conjunction with Rainy, you see that when it's rainy and windy, you don't go out to play. Essentially, when the metrics (correlation/info gain) report utility values of the features, they do that for each feature independently. Sometimes the features on their own are not useful but combined with other features, can be. This explains why you observe a low info gain value, yet a high accuracy values with those features.

This article is a good read on feature selection using Weka. What you want to do is select the features that give a high score on these metrics on their own. But even then, since different metrics give different results, you will need to perform some trial and error to ensure that you have the right set of features. Maybe you can choose the features that all the metrics agree upon, and then try adding other features one by one, to see whether that improves the accuracy or not.

First, thanks for your answer. My data is videos, and i'm extracting most of my features from dense optical flow maps. My features have to summarize spatial and temporal movement information, which is quite hard because of my lack of data. This is why i have a lot of features which may be redundant. I need to do the more with the smaller feature set, so i must have a good methodology to evaluate features, in order to keep only the very best and avoid over-fitting. If i delete redundant feature, i can have less than 10 of them, and then i will see if they can bring to an efficient classifier. — mprl, May 16 '17 at 12:52

Suspecting overfitting after feature-selection. Need your opinion and/or solution

1 Answers1