0

I have a dataset with an event rate of less than 0.3 percent. To improve the modeling results, I did some oversampling using SMOTE.

I initially oversampled so that the event rate increases 10 times to 3 percent. But that doesn't feel right. Are there any restrictions or heuristics on how much we can oversample.

Are there things I need to consider in deciding how much to oversample.

Clock Slave
  • 787
  • 7
  • 21
  • There's no general rule because it depends on the specific dataset. You can try to perform cross validation using increasing percentages and check whether there is a jump in the performances at some point. Then you can investigate what happens around that range. –  Jul 17 '20 at 19:00
  • Focus on an appropriate metric first. And consider rebalancing the sample (if indeed needed). – usεr11852 Jul 17 '20 at 19:01
  • @usεr11852 I am using F1 scores to evaluate the model. Post oversampling I get good very different F1 score on train and test but consistent results for recall score on train and test. – Clock Slave Jul 17 '20 at 19:06
  • @ping I'll try that – Clock Slave Jul 17 '20 at 19:07
  • Just to be clear, we must never use our oversampled set for testing. $F_1$ is fine but do note that it does not account for True Negatives at all. Maybe using a scoring rule like AUC-ROC or Brier score is more informative. – usεr11852 Jul 17 '20 at 19:08
  • @usεr11852. I have done oversampling on the training dataset. The test set hasn't been modified. I got recalls of 0.89 and 0.82 on train and test. But I got an F1 of 0.63 and 0.07 on the same train and test sets. Precision scores were similar to F1 scores. I want to use the model because it has good recall but the F1 tells a different story. – Clock Slave Jul 19 '20 at 08:14

0 Answers0