I'm currently trying to train a classifier with very few data points (41, 3 classes, supervised). The dataset is peculiar, so I also have to do a lot of feature engineering.
In order to evaluate my features (I have 32, but some of them may be redundant), to see which set of features is the best and to verify that the few data points I have are enough to train a decent classifier, I am doing a step of feature selection before training a naive bayes (the only classifier that works well with logistic regression). What bother me is this (I'm using Weka):
- When I use a wrapper with Naive Bayes with "best first" search method, the best features set has features labelled 1, 4, 6, 7, 9, 13, 18, 29. With these features, I get 82% accuracy with 10-fold cross-validation.
- When I use filters like "Correlation", "GainRatio" or "InfoGain" and rank them, these features do not get an especially high rank. When I try to use better ranked features, the accuracy drops to 60/70% accuracy max.
- My major trouble is that features 4 and 6 have a score of 0 with GainRatio and InfoGain. To me, it means that these features bring absolutely no information to the classification problem, and that they were chosen by the wrapper just because they worked well by luck in this training context, although they are irrelevant.
- To check that, I added some (20) random variables to the original feature set and did the same feature selection. The wrapper selected 2 random variables out of the 5 features selected, and GainRatio and InfoGain gave a score of 0 to all random variables. It seems to confirm what I said earlier. However, this set gives me poor accuracy (75%) compared to the one selected by the wrapper before the random variables were added to the feature set (which probably means that this method does not test every combination, so, tell me if I'm wrong, and how to reduce the risk of overfitting).
Is my training overfitting, or is the 82% accuracy a reliable score?
Should I delete the features 4 and 6 because of InfoGain/GainRatio, or are these two metrics not ultra-reliable? How can I be certain of that?
If it is overfitting, could you recommend me a methodology to do features selection without overfitting risk?
Thanks, have a good day.