0

What are the best ways to deal with imbalanced datasets for classifying whether or not individuals pay their tuition? The data is 75% positive class (paid) and 25% negative (unpaid). Some approaches I have read about include stratified k-folds , undersampling and oversampling, and synthetic data with approaches like SMOTE. One challenge I am currently facing is that my XGBoost classifier predicts almost all positives because there is a class imbalance leaning towards the positive class.

Instead of tackling the imbalance by modifying the data, can certain classification algorithms handle imbalanced data better than others?

Lastly, when is data considered imbalanced from a practical standpoint (60-40, 80-20, 95-5, etc.)? Essentially I am asking whether the mild cases of imbalance are still worth addressing, or only severe ones?

Jane Sully
  • 788
  • 1
  • 9
  • 23
  • 2
    There are many pages on this site dealing with these questions. This [list of top-voted questions and answers](https://stats.stackexchange.com/questions/tagged/unbalanced-classes?sort=votes&pageSize=50) is a good place to start. Please look some of those over first, and then edit this question to focus on what you still find confusing. It might be best if you could describe a practical situation that you face, as more specific questions can often get more useful answers. – EdM Oct 14 '18 at 17:39
  • In what sense do you believe that imbalanced data is a problem? What problem are you trying to solve? – Sycorax Oct 14 '18 at 23:45
  • It's not that I believe it's a problem, because I think it is representative of the larger data at hand (I am predicting whether individuals pay their fees or not). However, my model is not very predictive, and that's where I think imbalanced data is a problem. The data I have is about 75% positive class (paid) and 25% negative (unpaid), but depending on the fold my model predicts all or nearly all the positive class. – Jane Sully Oct 15 '18 at 00:01
  • The statement "depending on the fold my model predicts all or nearly all the positive class" makes it sound like you're using accuracy or similar to assess your model. Accuracy is not a good metric to use to evaluate classifier performance. See https://stats.stackexchange.com/questions/312780/why-is-accuracy-not-the-best-measure-for-assessing-classification-models for more details. – Sycorax Oct 15 '18 at 00:34
  • Nope, I am using precision, recall, and auc. That comment was based using the confusion matrix for each fold in my kfolds cross validation. – Jane Sully Oct 15 '18 at 00:47
  • A confusion matrix is not sensitive to the degree of correctness of the model's scores, in the sense that it is based on a dichotomous threshold. Class imbalances are only a "problem" in the sense that confusion matrices, like accuracy, are misleading when class imbalance is present. This is a reason to avoid confusion matrices, not a reason to mutilate your data. – Sycorax Oct 15 '18 at 04:21
  • Okay, you bring up a good point. So you would just leave the data as is? My problem is that the model holds little predictive power because it guesses positive for nearly every example, and I would ideally want it to do better than something that does not differentiate between examples. Also, although I cannot share the data itself, I would be happy to share any anonymized info about the data and performance. – Jane Sully Oct 15 '18 at 12:21
  • See the discussion on [this page](https://stats.stackexchange.com/q/90659/28500) for why precision and recall aren't good metrics. You need a model that gives class probabilities rather than making some arbitrary choice about parameter settings or cutoff values to fill a confusion matrix. Then rate models by their [Brier Scores](https://en.wikipedia.org/wiki/Brier_score), the mean-square difference between predicted probabilities and actual class membership coded as {0,1}. You then choose a probability cutoff for classification that takes into account your tradeoffs for false classifications. – EdM Oct 15 '18 at 15:18

0 Answers0