Can oversampling be moved outside stratified k-fold CV?

Question

In a binary classification task, I am using imbalanced-learn's implementation of SMOTENC to oversample the positive class of a very imbalanced dataset. The total number of examples is very high, so that this oversampling takes quite a while.

I would now like to perform gridsearch and do k-fold cross validation with stratified sampling for each set of parameters. I am aware that the proper way is to perform the oversampling on the training set of each fold (as imbalanced-learn's pipeline class does).

Due to the large computation time for oversampling, however, I would like to move that step outside the CV, so that I need to resample only once for all folds. Is there a sensible way to do that, preferably with tools provided by sklearn or other compatible Python libraries?

As concerns theoretical soundness, I suspected it would be reasonable to do the following:

Resample the entire dataset, but keep synthetic examples separate.
During CV, for each fold, do stratified sampling on the non-resampled data.
Adding a likewise stratified set of synthtetic data to the training set of the fold.

But now I see that this leads to data leakage between sets, since some information from the test sets is bound to be contained in the synthetic data in the train sets. And since synthetic examples cannot be attributed to the real examples they stem from, this can probably not be remedied.

Thanks!

score 3 · Accepted Answer · answered Nov 07 '20 at 11:41

You could do the oversampling outside/before the cross validation iff you keep track of the "origin" of the synthetic samples and treat them so that no data leak occurs. This would be an additional constraint similar to e.g. a stratification constraint.

This is possible e.g. by doing a cross validation on the real-sample basis and inside the cross validation add only those synthetic data to the train (test) data where the original cases are both in the training (test) set.
For testing, I'd anyways at least keep track of what results were obtained with real data and what with synthetic data.

OTOH, there is the question whether this is worth while, and whether the oversampling isn't only curing a symptom that actually points to deeper problems (for which more appropriate direct solutions are available):

Thanks for this helpful answer. Also, thanks for the links which, if I interpret them correctly, seem to confirm my lingering inuition that the problem of imbalanced classes is more of a cargo cult than a real problem (as long as one does not blindly evaluate anything using plain accuracy as a metric). In fact, initial tests with my classification problem had shown oversampling not to markedly affect performance in terms of area under the ROC or PR curve. I guess I was grabbing straws to get the best possible performance... So you're right, implementing your solution is maybe not worth it. :) — JDsallin, Nov 08 '20 at 20:22

Can oversampling be moved outside stratified k-fold CV?

1 Answers1