1

I’m dealing with a highly unbalanced dataset where 20% of data belongs to class A and 80% belongs to class B.

It’s very hard for us to produce synthetic class A data.

Just wondering if the below approach is a sensible thing to do:

Total data points: 100

Class A : 20

Class B : 80

How about splitting the dataset into 4 separate samples consisting of 20 A’s and 20 B’s. In other words, I’m mixing the 20 A’s with different samples of 20 B’s. We’d have 4 models (say random forest or so) and finally the decision is taken from what the majority of these 4 models predict?

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
user1189332
  • 181
  • 1
  • 3
  • 3
    Also highly imbalanced is far from 80-20. While there is no strong consensus in the literature I have seen, "highly imbalanced" has almost always been reserved for at lest 95-05 cases and usually commonly for 99-01. – usεr11852 Sep 29 '18 at 14:48
  • 1
    Some other possibe dup targets: https://stats.stackexchange.com/questions/235808/binary-classification-with-strongly-unbalanced-classes, https://stats.stackexchange.com/questions/6067/does-an-unbalanced-sample-matter-when-doing-logistic-regression, https://stats.stackexchange.com/questions/147021/random-forests-overfitting-unbalanced-classes, https://stats.stackexchange.com/questions/17225/when-over-under-sampling-unbalanced-classes-does-maximizing-accuracy-differ-fro, https://stats.stackexchange.com/questions/227088/when-should-i-balance-classes-in-a-training-data-set – kjetil b halvorsen Sep 29 '18 at 14:49
  • Somehow I found the duplicated articles very helpful, which makes this post helpful as well. – Jinhua Wang Jan 27 '19 at 17:38

0 Answers0