0

I have a classification task (predicting DNA methylation) with a somewhat unbalanced dataset - 38% of values are in the minority class, and the other 62% in the majority class.

I have read that one way to work with unbalanced data is to do over-/undersampling, however what i have not found is at what treshold of 'unbalance' or ratio between majority and minority class does it make sense to consider over-/undersampling?

I get it does make sense at 99%/1% and it does not at 50%/50%. But what at 40%/60%? or 20%/80%? Which metrics can I consider to decide whether I want to over-/undersample?

charelf
  • 171
  • 4
  • 1
    [Class imbalance is less of a problem than you might think](https://stats.stackexchange.com/questions/357466/are-unbalanced-datasets-problematic-and-how-does-oversampling-purport-to-he), and undersampling/oversampling cannot fix a non-problem. Additional links: [(1)](https://www.fharrell.com/post/class-damage/) [(2)](https://www.fharrell.com/post/classification/) [(3)](https://twitter.com/f2harrell/status/1062424969366462473?lang=en) – Dave Nov 01 '21 at 17:51
  • 1
    Link [(4)](https://stats.stackexchange.com/questions/548339/why-class-balancing-techniques-are-sometimes-useful)...to me, it looks like Dikran Marsupial nails it. – Dave Nov 01 '21 at 17:57
  • Thank you, these look helpful. Will take a look and close the question if answered. – charelf Nov 01 '21 at 17:59

0 Answers0