0

To solve imbalanced data, I used oversampling strategy using ROSE algorithm in Python. As you may know, ROSE is a smoothed bootstrapping method and we can control the dispersion of the augmented data.

I am wondering — is there any rule of thumb on the dispersion (or shrinkage parameter in Python imbalanced learn (imblearn) package? We would like to publish a paper on this, so I would probably need to justify the dispersion of augmented data.

  • Why is imbalanced data a problem for your work? Statisticians tend not to consider it a problem. https://stats.stackexchange.com/questions/357466/are-unbalanced-datasets-problematic-and-how-does-oversampling-purport-to-he https://www.fharrell.com/post/class-damage/ https://www.fharrell.com/post/classification/ https://stats.stackexchange.com/a/359936/247274 https://stats.stackexchange.com/questions/464636/proper-scoring-rule-when-there-is-a-decision-to-make-e-g-spam-vs-ham-email https://twitter.com/f2harrell/status/1062424969366462473?lang=en – Dave Nov 15 '21 at 02:11
  • 1
    So you're saying that oversampling and undersampling is useless/misleading and one should use proper metrics as opposed to those methods? If so, is there any published paper that I could reference to for this? – Darren Christopher Nov 15 '21 at 02:34

0 Answers0