Classification model on a highly unbalanced dataset

Question

I’m dealing with a highly unbalanced dataset where 20% of data belongs to class A and 80% belongs to class B.

It’s very hard for us to produce synthetic class A data.

Just wondering if the below approach is a sensible thing to do:

Total data points: 100

Class A : 20

Class B : 80

How about splitting the dataset into 4 separate samples consisting of 20 A’s and 20 B’s. In other words, I’m mixing the 20 A’s with different samples of 20 B’s. We’d have 4 models (say random forest or so) and finally the decision is taken from what the majority of these 4 models predict?

Also highly imbalanced is far from 80-20. While there is no strong consensus in the literature I have seen, "highly imbalanced" has almost always been reserved for at lest 95-05 cases and usually commonly for 99-01. — usεr11852, Sep 29 '18 at 14:48
Some other possibe dup targets: https://stats.stackexchange.com/questions/235808/binary-classification-with-strongly-unbalanced-classes, https://stats.stackexchange.com/questions/6067/does-an-unbalanced-sample-matter-when-doing-logistic-regression, https://stats.stackexchange.com/questions/147021/random-forests-overfitting-unbalanced-classes, https://stats.stackexchange.com/questions/17225/when-over-under-sampling-unbalanced-classes-does-maximizing-accuracy-differ-fro, https://stats.stackexchange.com/questions/227088/when-should-i-balance-classes-in-a-training-data-set — kjetil b halvorsen, Sep 29 '18 at 14:49
Somehow I found the duplicated articles very helpful, which makes this post helpful as well. — Jinhua Wang, Jan 27 '19 at 17:38

Classification model on a highly unbalanced dataset

0 Answers0