1

I have my training data with the following approximate distribution:

  • Negative events : 90,000
  • positive events : 5,000

Training a model would require to oversample the minority class (and might also need to undersample the majority class) as the classes are vastly imbalanced in proportion.

To what extent is it fair to upsample? Let's say I want the distribution of events to be 50% - 50% in the training set. Should I oversample the positive events currently from 5,000 to 90,000 observations?

That is like upsampling the positive events 18 times! Would that not add unwanted noise? Or should I upsample it to something like 50,000 (10 times) and downsample the majority class too from 90,000 to 50,000?

We can even upsample it to just 10,000 (2 times) and take equal number of random observations from the majority class, but then there rises the chance of this majority class sample not being representative of the entire population. We need to ensure we take a stratified sample from the majority class, which might be achieved by forming clusters and taking samples from each cluster in proportion to the respective cluster size. However, that is a different exercise altogether and a lengthy approach.

Please guide me on what should be the ideal approach to deal with such scenarios.

  • Unbalanced classes are almost certainly not a problem, and oversampling will not solve a non-problem: [Are unbalanced datasets problematic, and (how) does oversampling (purport to) help?](https://stats.stackexchange.com/q/357466/1352) – Stephan Kolassa Sep 27 '18 at 08:30
  • 1
    @StephanKolassa I went through the links, but doesn't exactly answer exactly my doubt about up to how many times can the minority class be upsampled? – Abhisek Dutta Sep 27 '18 at 08:47
  • Your second approach (the lenghty one) is more appropriate. – user2974951 Sep 27 '18 at 09:09
  • 1
    My point is that you should *not* upsample the minority class. Instead, use better approaches that account for the asymmetry in classes. – Stephan Kolassa Sep 27 '18 at 09:14

0 Answers0