Data set with high imbalance and extremely low frequency of the minority class

Question

I have a binary classification data set with a very high imbalance of 15126 - 8. I've tackled a few data sets with very high imbalance, but the frequency of the minority class is so low here... Can someone give me a hint for a direction?

This is probably too broad to be answerable here. What do you want to know? Can you make this more concrete than "give me a hint for a direction"? — gung - Reinstate Monica, Sep 12 '16 at 00:54

score 0 · Answer 1 · answered Sep 09 '17 at 20:03

Extreme class imbalance does make a classification task a lot more difficult. I have dealt with this in my own work on predicting suicide attempts and classifying rare facial expressions. There's no easy cure-all answer (yet at least) but a lot of smart people are working on the issue and have techniques that can help. Depending on the classification algorithm you're using, you might consider a cost-based approach that penalizes errors on the minority class more strongly than errors on the majority class or a resampling-based approach that artificially creates an evenly distributed dataset. See the following paper for more details/ideas:

Learning from imbalanced data sets by He and Garcia.

http://ieeexplore.ieee.org/abstract/document/5128907/?reload=true

Data set with high imbalance and extremely low frequency of the minority class

1 Answers1