I have a binary classification data set with a very high imbalance of 15126 - 8. I've tackled a few data sets with very high imbalance, but the frequency of the minority class is so low here... Can someone give me a hint for a direction?
-
2This is probably too broad to be answerable here. What do you want to know? Can you make this more concrete than "give me a hint for a direction"? – gung - Reinstate Monica Sep 12 '16 at 00:54
-
1You have only 8 examples from the minority class? – bean Aug 22 '19 at 20:33
-
Maybe look into anomaly detection ... – kjetil b halvorsen Aug 22 '19 at 20:57
1 Answers
Extreme class imbalance does make a classification task a lot more difficult. I have dealt with this in my own work on predicting suicide attempts and classifying rare facial expressions. There's no easy cure-all answer (yet at least) but a lot of smart people are working on the issue and have techniques that can help. Depending on the classification algorithm you're using, you might consider a cost-based approach that penalizes errors on the minority class more strongly than errors on the majority class or a resampling-based approach that artificially creates an evenly distributed dataset. See the following paper for more details/ideas:
Learning from imbalanced data sets by He and Garcia.
http://ieeexplore.ieee.org/abstract/document/5128907/?reload=true

- 3,922
- 1
- 13
- 36