A question about a logistic regression classifier performance (with and without resampling)

Question

I am working on a dataset with 20 independent variables and 41188 instances. The task is a binary classification where the target variable has 36548 number of no's and 4640 of yes's. I have used logistic regression model with 10 folds of cross validation. Since the target variable is unbalanced, I decided to resample data. I made the model 3 times: first without resampling data, then resampling data once with under-sampling technique and once with SMOTE technique. Following are the reports gained:

- Without resampling:

- With Under-sampling technique

- With SMOTE technique:

Usually with unbalanced data, accuracy alone is not sufficient to evaluate the performance of the classifier and thus precision, recall and ROC values should be taken into accounts as well. The first model made without any resampling techniques, delivers weighted average higher accuracy, precision and recall values than the the others while its ROC value is slightly less than the model with SMOTE resampling technique. Moreover, in the first model the recall value of class yes (0.423) is much lower than the recall value of class no (0.973).

My question is which model is more trustable? and why the accuracy, precision and recall values were decreased after resampling the data?

Our Frank Harrell has a nice post about SMOTE: https://twitter.com/f2harrell/status/1062424969366462473?lang=en. — Dave, Dec 28 '20 at 19:06

score 0 · Answer 1 · answered Dec 31 '20 at 03:10

The absolute values in the confusion matrices suggest that you've resampled the data before doing the cross-validation splits. That's generally a big no-no, as your scores aren't representative of model performance on real representative data.

The first six score columns are all based on the confusion matrix, and so depend on a cutoff/threshold probability (presumably 0.5). Resampling the data largely just shifts the predicted probabilities, so that this cutoff is effectively rather different. For the same reason, the ROC curve is relatively unchanged, because it depends only on the ordering of the predicted probabilities. For the PR curve, resampling does have a significant effect (for the rare positive class), see my answer on a DS.SE question.

A question about a logistic regression classifier performance (with and without resampling)

1 Answers1