Why my Test AUPRC is much lower than validation AUPRC. (Credit Card Fraud Dataset)

Question

I'm trying to make a model for Credit Card Fraud Dataset. I used a combination of under/oversampling to balance out the data. I ran a NN model and I tuned it with keras-tuner. the best val_aucI got is around 98%. but when I run the unbalanced test data I get around 53% AUPRC.

Report:
              precision    recall  f1-score   support

   Not Fraud       1.00      0.90      0.95     56861
       Fraud       0.02      0.87      0.03       101

    accuracy                           0.90     56962
   macro avg       0.51      0.89      0.49     56962
weighted avg       1.00      0.90      0.95     56962

what can be the reason for this and what can I do to make my model better?

This is the kaggle notebook that I'm working on and has the code: Kaggle Notebook

https://stats.stackexchange.com/questions/357466/are-unbalanced-datasets-problematic-and-how-does-oversampling-purport-to-he https://www.fharrell.com/post/class-damage/ https://www.fharrell.com/post/classification/ https://stats.stackexchange.com/a/359936/247274 https://stats.stackexchange.com/questions/464636/proper-scoring-rule-when-there-is-a-decision-to-make-e-g-spam-vs-ham-email https://twitter.com/f2harrell/status/1062424969366462473?lang=en https://stats.stackexchange.com/questions/368949/example-when-using-accuracy-as-an-outcome-measure-will-lead-to-a-wrong-conclusio — Dave, Jul 02 '21 at 14:50

score 1 · Accepted Answer · answered Jul 02 '21 at 14:46

This is an example of data leakage. You perform preprocessing (which most importantly includes your oversampling) before the train-validation split. This means that not only is your validation data going to be balanced (which is unrepresentative if your unbalanced test data) but it will consist of samples which have been created from interpolating training samples, meaning you're evaluating your validation metrics on 'blurred' training data, hence the higher performance.

Explicitly split off your validation data before any preprocessing, and apply the same preprocessing to your validation data as your test data.

Additionally, resampling isn't necessarily the best way to combat class imbalance. Another technique is utilising class weights or sample weights: this effectively 'skews' your calculated loss such that the network learns at different rates for the different classes, i.e. smaller weight updates for the larger class, larger weight updates for the smaller class. I don't think keras supports class weights the same way something like sklearn does, but you can simulate class weights using sample weights by simply assigning the weight based off which class the sample belongs to. Just some food for thought, I recommend you start out with only fixing your validation data splitting first before trying anything else.

As the links I just posted show, the oversampling step is not only performed at the wrong time but is not even necessary. (Particularly Frank Harrell’s tweet shows when statisticians think oversampling should occur.) — Dave, Jul 02 '21 at 14:52
@Dave Indeed, I can confirm the posts on fharrel.com are also very informative and helpful. They helped me out when I had a question on SMOTE here a few months ago and you helped me out. It's just a shame that when I was taught ML at uni techniques like SMOTE were portrayed as a magical fix. — Avelina, Jul 02 '21 at 15:01

Why my Test AUPRC is much lower than validation AUPRC. (Credit Card Fraud Dataset)

1 Answers1