Model Overfitting When Employing Upsampling/DownSampling

Question

I have a binary classification problem with a large class imbalance ( 1/100 ).

I am getting fair results using ensemble modeling.

I understand that one technique that could improve results is upsampling the minority class and / or downsampling the majority class.

I have found that when I upsample the minority class (only in the training set of course) to a ratio 1/10 my cross validation results improve substantially and my performance of the model fit on the entire training set on unseen data improves relative to cross validation and unseen performance prior to up sampling the training data.

Here's the question:

When we upsample the training set we effectively alter its composition to bias the model with hopes getting better performance on unseen data. However, cross validation performance on the (altered ) training set is no longer reflective of expected performance on unseen data.

How do you assess model overfitting in this situation since comparison of cross val performance on training no longer reflects/ approximates fully trained model performance on unseen data?

[Are unbalanced datasets problematic, and (how) does oversampling (purport to) help?](https://stats.stackexchange.com/q/357466/1352) — Stephan Kolassa, Mar 23 '21 at 05:50

score 2 · Answer 1 · answered Mar 26 '21 at 12:02

I think you should apply undersampling/oversampling separatedly on each training fold of the CV, leaving the validation fold intact, so that the metric you get on the validation set is valid and closer that you would have on unseen data.

I recommend you use the imblearn pipeline object, because it is designed that way and you don't have to take care of it by yourself.

Otherwise, you can take a look at this library that I have made for myself, and it's now pip installable, that performs nested cross validation with a lot of options and features:

https://github.com/JaimeArboleda/nestedcvtraining

Model Overfitting When Employing Upsampling/DownSampling

1 Answers1