In a binary classification task, I am using imbalanced-learn's implementation of SMOTENC to oversample the positive class of a very imbalanced dataset. The total number of examples is very high, so that this oversampling takes quite a while.
I would now like to perform gridsearch and do k-fold cross validation with stratified sampling for each set of parameters. I am aware that the proper way is to perform the oversampling on the training set of each fold (as imbalanced-learn's pipeline class does).
Due to the large computation time for oversampling, however, I would like to move that step outside the CV, so that I need to resample only once for all folds. Is there a sensible way to do that, preferably with tools provided by sklearn or other compatible Python libraries?
As concerns theoretical soundness, I suspected it would be reasonable to do the following:
- Resample the entire dataset, but keep synthetic examples separate.
- During CV, for each fold, do stratified sampling on the non-resampled data.
- Adding a likewise stratified set of synthtetic data to the training set of the fold.
But now I see that this leads to data leakage between sets, since some information from the test sets is bound to be contained in the synthetic data in the train sets. And since synthetic examples cannot be attributed to the real examples they stem from, this can probably not be remedied.
Thanks!