0

I have a binary classification data set with a very high imbalance of 15126 - 8. I've tackled a few data sets with very high imbalance, but the frequency of the minority class is so low here... Can someone give me a hint for a direction?

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
Mike
  • 1

1 Answers1

0

Extreme class imbalance does make a classification task a lot more difficult. I have dealt with this in my own work on predicting suicide attempts and classifying rare facial expressions. There's no easy cure-all answer (yet at least) but a lot of smart people are working on the issue and have techniques that can help. Depending on the classification algorithm you're using, you might consider a cost-based approach that penalizes errors on the minority class more strongly than errors on the majority class or a resampling-based approach that artificially creates an evenly distributed dataset. See the following paper for more details/ideas:

Learning from imbalanced data sets by He and Garcia.

http://ieeexplore.ieee.org/abstract/document/5128907/?reload=true

Jeffrey Girard
  • 3,922
  • 1
  • 13
  • 36