I have an extremely unbalanced dataset with two classes:
1: 1,800 # class 1
0: 40,000 # class 0
This is real world customer data of churned/not churned
If I were to use smote
to oversample / undersample, the model would be trained on an artificially created distribution of classes that does not reflect the real world whatsoever.
Should the training data reflect the real world, or is smote
a widely used technique to balance the classes in production environments? I feel like I would be training a model for a completely different task if the training data doesn't reflect the unseen data.
My model is currently performing quite poorly on the 1
class, which is why I'm investigating smote
.
Any insight is appreciated from people who have used smote. I can't find info of this issue in smote tutorials online. So perhaps it's not an issue?