Does oversampling/undersampling change the distribution of the data?

Question

I have an imbalanced dataset (10000 positives and 300 negatives) and have divided this into train and test sets. I perform oversampling/undersampling only on the train set since doing this on the test set would not represent a real-world scenario.

A Random Forest Classifier is able to classify the training set well (F-score of 0.92 for both positive and negative class) but performs badly on the test set (F-score of 0.83 for the positive class and 0.13 for the negative class).

Why does the classifier perform poorly on the test set although it has learnt to identify the difference between the two classes in the train set? Could it be because the distribution of the train set is now different from the test set? If so, how do I take care of this?

I came across this post but the answers are not particularly helpful.

Over/undersampling doesn't add any new information, it only replicates data, which is done to prevent the model from being biased, but still doesn't help the model to learn better. — user2974951, Sep 25 '19 at 13:37
Yes, this was my understanding too. But why is the model performing so poorly on the test set even though it has learnt features from the train set? Also, if I perform oversampling/undersampling on the test set as well, I get good results on both the positive and negative classes. This led me to believe that sampling changes the distribution of the data. — Anuj, Sep 25 '19 at 13:46
https://stats.stackexchange.com/questions/283170/when-is-unbalanced-data-really-a-problem-in-machine-learning — mkt, Sep 25 '19 at 13:55
Why do you think 0.83 is poor? It's normal for performance on the test set to be worse than on the training set — mkt, Sep 25 '19 at 13:55
@mkt It is 0.83 for the positive class which is the majority but only 0.13 for the negative (minority) class. It classifies data from the positive class as the negative class. — Anuj, Sep 25 '19 at 14:17
My mistake. See https://stats.stackexchange.com/a/210718/121522 and https://stats.stackexchange.com/questions/312780/why-is-accuracy-not-the-best-measure-for-assessing-classification-models — mkt, Sep 25 '19 at 14:21
@mkt thank you for the links but I don't see how they answer my question. The first link has a post about how f-1 score is not an ideal performance metric for imbalanced classification. But I have read numerous papers, like [this](https://statistics.berkeley.edu/sites/default/files/tech-reports/666.pdf) which prove that f-1 score is a good performance metric. The second link has good points about why accuracy is not the best measure, which I am well aware of, and that is exactly why I am not using accuracy to measure the performance of the model. — Anuj, Sep 25 '19 at 14:40
I posted the links because improper scoring rules are a common topic of discussion here. The second link focuses on accuracy but it illustrates some of the problems with improper scoring rules in general. It's not my area, so I'll leave it at that. — mkt, Sep 25 '19 at 14:48

score 3 · Answer 1 · answered Sep 25 '19 at 17:12

The answer to the title question is "of course it does"; you are shifting the distribution toward the minority class.

You can shift your model's predictions back to match the original distribution, see e.g. Convert predicted probabilities after downsampling to actual probabilities in classification or, equivalently, adjust the prediction threshold.

There's also a serious question on whether you needed to resample in the first place. See What is the root cause of the class imbalance problem?, When is unbalanced data really a problem in Machine Learning? If you do get better performance after balancing, with correct use of prediction thresholds/shifting, I'd like to know about it. I haven't been able to find a definitive answer on whether balancing helps a classifier learn. (Henry's answer to the second linked question here suggests not, but...)

Thanks for the reply! The links were very useful. Would the model perform better if the model's predictions are changed to match the original distribution i.e. would the formula mentioned in https://www3.nd.edu/~rjohns15/content/papers/ssci2015_calibrating.pdf actually improve the number of correct classifications? The authors of that paper have mentioned that the formula does not affect the ranking. — Anuj, Sep 25 '19 at 18:22
The shift is a linear shift to log-odds, so it's monotonic on probabilities, so it doesn't affect rank-ordering (and hence neither the AUROC/AUPRC). It will affect the F1-score (and all confusion matrix based scores), but the same effect can be made by adjusting the threshold for class predictions. — Ben Reiniger, Sep 25 '19 at 18:40

etudiant · Answer 2 · 2019-10-18T05:38:39.990

F1 score is the harmonic mean of precision and recall. Precision is defined as $$\text{precision} = P(Y = 1 | \hat Y = 1)$$ while recall is defined as $$ \text{recall} = P(\hat Y = 1 | Y = 1)$$ Recall does not care about the distribution of true class labels since it is being conditioned upon in the definition. Say you trained your model and the recall is 90%. Putting more positive samples in the test set will not change that. This is not true for precision. You could in principle throw in tons of negative samples in your test set, and the precision will be pushed towards zero. In short, your model might be doing ok. Comparing class distribution sensitive metrics to detect over/under fitting when you up/down sample training set will give you misleading result.

To detect over/under fitting in your case, I would suggest look at the ROC curve first. ROC is not sensitive to class distribution. If ROC AUC differs a lot between training and test set, then your are likely to over/under fit. Then you can look at things like average precision or others on test set.

Christopher John · Answer 3 · 2019-10-19T14:44:13.610

I'd suggest avoiding oversampling and undersampling approaches. For a good classifier and performance metric it should not matter and it is unscientific. The F1 score is bias towards the majority class and is class distribution sensitive, see https://arxiv.org/ftp/arxiv/papers/1503/1503.06410.pdf. Also consider the ROC and ROC-AUC as an objective way of assessing performance which is suited for imbalanced data. In general it is better to look at a curve than rely on a single default cut-off like p=0.5 as the F1 score. I prefer the precision recall gain curve to precision recall, because it is standardised to the baseline. For PR curves it matters a lot which is defined as the positive class and is a deficiency of the approach.

Does oversampling/undersampling change the distribution of the data?

3 Answers3

Linked