Does up-sampling lead to lots of false positives in production?

Question

Say we have a dataset with a binary outcome variable that takes the positive case (outcome = 1) roughly 20% of the time. Often, we would modify the training set by down-sampling the 0's such that the training set has something like a 50/50 split in the outcomes between 0's and 1's.

When this model is in production, though, wouldn't it experience a lot of false positives? Its perception of the "baseline" or "average" (e.g. intercept in logistic regression) seems to be too high. It seems to me like it would end up predicting too many 1's. Am I off base here?

I do NOT like removing data. I prefer to duplicate the under-sampled class. It also depends on the nature of the fit. Again, we use the train/valid/test split to tune model structure, approach, and the estimate performance. When we roll it out, we use all the data to train the parameters, and if we expected balanced levels in production, we use balanced levels in the training data. — EngrStudent, Sep 24 '20 at 13:28

score 0 · Answer 1 · edited Sep 24 '20 at 14:04

I will answer this question in a bit of a rephrased form: "What is the best sampling scheme with imbalanced data?"

If you take a look at my answer below, I link to a few different articles and explain how to deal with imbalanced data to create a stable classifier (whether that's in a live/production environment or in a static model. Obviously, live/production models are sensitive to data, so it's hard to really assess what will go on at first, but a good sampling scheme and loss function, will help address issues with imbalanced data.

The classic imbalanced sampling technique is SMOTE (see ref below), which oversamples from the minority class to synthetically increase its prevalence. Boosting algorithms (like adaboost)also will oversample the cases it got wrong, in order to fix issues with predictions. Focal is similar in that it will down-weight the "easy" predictors (in the loss function), so it makes sense to use it. The tricky part is that boosting algorithms are essentially prone to overfitting since their sampling is gradient-based to reduce error, so one must be always careful with how to introduce sampling schemes and loss functions. That's the only caveat with them. Below I've included all 3 references.

SMOTE: Chawla, Nitesh V., Kevin W. Bowyer, Lawrence O. Hall, and W. Philip Kegelmeyer. "SMOTE: synthetic minority over-sampling technique." Journal of artificial intelligence research 16 (2002): 321-357.

Adaboost: Rätsch, Gunnar, Takashi Onoda, and K-R. Müller. "Soft margins for AdaBoost." Machine learning 42, no. 3 (2001): 287-320.

Focal: Lin, T. Y., Goyal, P., Girshick, R., He, K., & Dollár, P. (2017). Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision (pp. 2980-2988).

Does it make sense to use Focal loss for a tree based classifier like XGBoost?

The one big issue to consider (which was addressed in the above comment) is that for optimal performance, you also want to train the model on similar data it will see in production. You will hurt it by forcing a 50%-50% class balance if in reality, the live data will be 20-80. So, that's where things like a Focal loss, that focus on the loss function, or adaboost, which oversamples the "wrong" cases in each subsequent boosting tree to correct the errors. Oversampling the "minority class" might hurt.

Does up-sampling lead to lots of false positives in production?

1 Answers1