Cross Validation in an Imbalanced data set

Question

What is the point of oversampling an imbalanced data set if the ratio of the classes needs to be preserved in Cross validation ? If I have 1000 rows in a data set where 800 rows belong to one class and 200 rows to another class, this would be considered an Imbalanced data set due to it's skewed distribution. If I over sample the smaller class by just duplicating the rows and make it 400 rows, the ratio of the major and minor class is changed. Now having done that, why would I ensure in my cross-validation folds (if I should) that the ratio of majority to minority class is still 80:20 ? What am I missing here ? Additionally, my test set might not have the same ratio of classes. How can I ensure that the skewed distribution does not impact my model ?

[Are unbalanced datasets problematic, and (how) does oversampling (purport to) help?](https://stats.stackexchange.com/q/357466/1352) — Stephan Kolassa, Jul 02 '20 at 08:55
Thank You @StephanKolassa I browsed the answer. It doesn't say anything about cross-validation and maintaining the ratio of majority:minority class in the context of oversampling. — learner, Jul 02 '20 at 09:09
@learner: well the answers explain that oversampling is usually not what you really want, i.e. that oversampling is besides the point. Which answers your question "what is the point in oversampling" with "none". As for the cross validation, you could of course oversample in each of your CV training sets. That would give you a valid CV with oversampling. — cbeleites unhappy with SX, Jul 02 '20 at 12:03
@cbeleites unhappy with SX: Just to clarify... Over sampling the data set is of no use, but over sampling each fold in k-fold Cross-validation would help the imbalance ? I am distorting the original 80-20 distribution of the classes to 67:33, in each fold, is that ok ? I am very confused — learner, Jul 02 '20 at 12:26
@learner: oversampling is frequently useless, and may even be harmful. The only exception I can think of being that you oversample to get closer to the known distribution of cases in your application (as opposed to getting closer to a balanced data set). But if you insist on doing oversampling, you should do it inside the CV. Or, if you do it outside, you'll have to make sure that all copies of each case end always up on the same side of the train/test splits. Otherwise, you'd not get independent splits. — cbeleites unhappy with SX, Jul 02 '20 at 12:34

score 0 · Answer 1 · answered Jul 02 '20 at 09:52

Keeping the class balance in Cross Validation means that we want every fold to have approximately the same distribution as our training set, so that their results are comparable among themselves, and comparable with the results on the full training set - otherwise when averaging over the folds, some might be biased due to some particularly unlucky sampling.
However, if your dataset is big enough this is usually not a problem worth focusing on.

If instead you decide to oversample your minority class for some reason (might have different misclassification costs, or other), then this oversampled ratio (in your example, you minority class would go from 20% to 33%) is the one you would want to have in your Cross Validation folds. This means that every fold should have around the same class ratios as your (new, oversampled) training set.

Cross Validation in an Imbalanced data set

1 Answers1