Limits of oversampling

Question

I have a dataset with an event rate of less than 0.3 percent. To improve the modeling results, I did some oversampling using SMOTE.

I initially oversampled so that the event rate increases 10 times to 3 percent. But that doesn't feel right. Are there any restrictions or heuristics on how much we can oversample.

Are there things I need to consider in deciding how much to oversample.

There's no general rule because it depends on the specific dataset. You can try to perform cross validation using increasing percentages and check whether there is a jump in the performances at some point. Then you can investigate what happens around that range. — , Jul 17 '20 at 19:00
Focus on an appropriate metric first. And consider rebalancing the sample (if indeed needed). — usεr11852, Jul 17 '20 at 19:01
@usεr11852 I am using F1 scores to evaluate the model. Post oversampling I get good very different F1 score on train and test but consistent results for recall score on train and test. — Clock Slave, Jul 17 '20 at 19:06
Just to be clear, we must never use our oversampled set for testing. $F_1$ is fine but do note that it does not account for True Negatives at all. Maybe using a scoring rule like AUC-ROC or Brier score is more informative. — usεr11852, Jul 17 '20 at 19:08
@usεr11852. I have done oversampling on the training dataset. The test set hasn't been modified. I got recalls of 0.89 and 0.82 on train and test. But I got an F1 of 0.63 and 0.07 on the same train and test sets. Precision scores were similar to F1 scores. I want to use the model because it has good recall but the F1 tells a different story. — Clock Slave, Jul 19 '20 at 08:14

Limits of oversampling

0 Answers0