After applying SMOTE, the class distribution doesn't match the real world. Is this a problem?

Question

I have an extremely unbalanced dataset with two classes:

1: 1,800 # class 1 
0: 40,000 # class 0

This is real world customer data of churned/not churned

If I were to use smote to oversample / undersample, the model would be trained on an artificially created distribution of classes that does not reflect the real world whatsoever.

Should the training data reflect the real world, or is smote a widely used technique to balance the classes in production environments? I feel like I would be training a model for a completely different task if the training data doesn't reflect the unseen data.

My model is currently performing quite poorly on the 1 class, which is why I'm investigating smote.

Any insight is appreciated from people who have used smote. I can't find info of this issue in smote tutorials online. So perhaps it's not an issue?

It is bad statistical practice that requires excluding data in order to make the method work. This is a symptom of using the [wrong accuracy scoring rule](http://fharrell.com/post/class-damage), and often of [misrecognizing the problem as classification instead of prediction](http://fharrell.com/post/classification). — Frank Harrell, Nov 27 '19 at 13:26
I am monitoring precison, recall, fscore and area under the precision-recall curve. I am also predicting probabilities and converting them to labels using different thresholds. The results are still not great even with threshold adjustment, which is why I am now investigating smote. I read your link, are you suggesting I evaluate the model with the suggested metrics such as "Brier score" rather than what I've already mentioned? In any case I am am looking for info on smote. You mention it's bad statistical practice - so it shouldn't be used at all I guess? — SCool, Nov 27 '19 at 14:24
@kjetilbhalvorsen I can't find anything about smote in that link. — SCool, Nov 27 '19 at 16:56
https://stats.stackexchange.com/questions/97555/handling-unbalanced-data-using-smote-no-big-difference and search this site! The link is mostly an explanation of the comments here by @Frank Harrell Take those seriously! — kjetil b halvorsen, Nov 27 '19 at 17:22

After applying SMOTE, the class distribution doesn't match the real world. Is this a problem?

0 Answers0