Will oversampling help with generalization (small imbalanced dataset)?

Question

I have an imbalanced dataset (2:1 ratio) with about 60 patients and 80 features. I performed RFE + stratified cross validation to reduce the features to 15 and I get an AUC of 0.9 with Logistic regression and/or SVM. I don't fully trust the AUC I got because I think it will not generalize correctly because of such a small positive class. So, I was thinking on oversampling (K-means + PCA) the minority class and re-run the RFE approach, would this help? Thanks.

My question is more or less the same as this one: https://datascience.stackexchange.com/questions/28227/why-will-the-accuracy-of-a-highly-unbalanced-dataset-reduce-after-oversampling but I do use AUC.

What exactly is your area under the curve (AUC)? If you're talking about the AUC of the ROC curve, your intuition is correct that your estimate will be too optimistic. AUC-ROC is biased for datasets with imbalanced classes. You should use the AUC of the Precision Recall (PR) curve. This metric isn't as sensitive to imbalanced classes. See this CV post for more info about classification metrics https://stats.stackexchange.com/questions/7207/roc-vs-precision-and-recall-curves — Tomas Bencomo, Jan 17 '20 at 23:43
I hope you are familiar with Sklearn's packages. I use classifier.fit().decision_function() and y_true as inputs for metrics.roc_curve() to calculate fpr,tpr pairs and use those to calculate auc using metrics.auc(). So I think I am calculating AU-ROC. Please correct me if I'm wrong. — Luis Pinto, Jan 18 '20 at 06:56
Yes that should be AUC-ROC. See the `average_precision_score` in sklearn to compute a precision recall score. I'd check out performance using this metric and see how it performs being moving forward with oversampling approaches — Tomas Bencomo, Jan 19 '20 at 20:26

Will oversampling help with generalization (small imbalanced dataset)?

0 Answers0