0

I have a very unbalanced dataset(99.8% negative,0.2% positive) with approximately 60 variables. I removed somewhere around 40 variables based on the variance inflation factor. Then I used SMOTE to oversample the minority data.


I am now using XGBoost to prepare a model. I have tried class weighting, regularization and adjusting variables using Randomized Search CV for other parameters. However my model still massively overfits, F1 score is approx 0.5 , which precision and recall both being dramatically reduced. How do I reduce overfitting in this scenario?
kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
  • [Our Stephan Kolassa argues that class imbalance presents no problem at all.](https://stats.stackexchange.com/questions/357466/are-unbalanced-datasets-problematic-and-how-does-oversampling-purport-to-he) // Are you sure you have overfitting and not just poor performance? How does your training performance compare to your out-of-sample performance? – Dave Jul 19 '21 at 21:48
  • How many total positive cases do you have? – EdM Jul 19 '21 at 21:56
  • 5934 samples, 111 are positive. Precision, Recall , ROC AUC, F1 Score all are 1.0 for training data. This is an overfitting issue – sinha-shaurya Jul 20 '21 at 05:31

0 Answers0