1

I have a dataset which I intend to use for Binary Classification. However my dataset is very unbalanced due to the very nature of the data itself(the positives are quite rare). The negatives are 99.8% and the positives are 0.02% . I have approximately 60 variables in my dataset.

I would like to do a feature importance test to eliminate less useful features. However I cannot directly run methods to do that, due to the unbalanced nature of the dataset. How do I approach this problem? P.s I plan to use Gradient Boosting and/or Neural Networks to work on this dataset.

1 Answers1

0

If you are planning to use Gradient Boosting and/or Neural Networks, they you do not need to worry about feature selection because the model will do it for you.

On the other hand, I do not understand why

I cannot directly run methods to do that, due to the unbalanced nature of the dataset.

The following plot is from a toy example on feature selection. If we are using PCA to do feature selection (which I do not recommend because PCA will not consider the class, but only the variance of the feature.) No matter the data is balanced or unbalanced, the algorithm will highlight the important feature (by variance).

enter image description here

unbalanced data does not make too much difference on feature selection.

Haitao Du
  • 32,885
  • 17
  • 118
  • 213
  • If I apply say SMOTE to balance the dataset(which I need to), then the selection may be biased. Which is why I was looking for any techniques which could be applied to imbalanced datasets – sinha-shaurya Jul 15 '21 at 06:00
  • @sinha-shaurya Good news! Class imbalance is not a problem! https://stats.stackexchange.com/questions/357466/are-unbalanced-datasets-problematic-and-how-does-oversampling-purport-to-he https://www.fharrell.com/post/class-damage/ https://www.fharrell.com/post/classification/ https://stats.stackexchange.com/a/359936/247274 https://stats.stackexchange.com/questions/464636/proper-scoring-rule-when-there-is-a-decision-to-make-e-g-spam-vs-ham-email https://twitter.com/f2harrell/status/1062424969366462473?lang=en – Dave Jul 15 '21 at 09:51