Adjusting precision recall curve for oversampling

Question

I built a model for a binary target using oversampled data. The population target prevalence is 0.25. I oversampled to 0.5 by keeping the entirety of the minority class and sampling a portion of the majority class. I then built a precision recall table using sklearn's:

#y is the binary target
#PR_CV are out of bag predictions
precision, recall, tr = precision_recall_curve(y, PR_CV)

I'd now want to know what precision and recall look like on the original population. I tried implementing the following, based on this article:

#odds = p/(1-p)
odds = y.mean()/(1-y.mean())
print 'oversampled odds:',odds

original_fraction = 0.25
original_odds = original_fraction/ (1 – original_fraction)
print "original odds:",original_odds

#Scoring_odds = scoring_results / (1 – scoring_results)
scoring_results = PR_CV
print "probability to revert:", scoring_results
scoring_odds = scoring_results/(1-scoring_results)
print"scoring odds:", scoring_odds

#adjusted_odds = Scoring_odds * original_odds / oversampled_odds 
adjusted_odds = scoring_odds * original_odds / odds
print "adjusted odds:", adjusted_odds

#adjusted_probability = 1 / (1 + 1/adjusted_odds))
adjusted_probability = 1 / (1 + 1/adjusted_odds)
print "adjusted probability:", adjusted_probability

I then calculated:

precision_adj, recall_adj, tr_adj = precision_recall_curve(y, PR_CV_adj)

What this returns is the exact same values as precision_recall_curve(y, PR_CV)

That fails my intuition...What is the right way to do this?

The question I'm truing to answer is: if on an oversampled population I expect at the top 5th percentile to have x true positives over X predicted positives (precision), then on the entire population what are my x2 true positives and my X2 predicted positives?

Good news! Class imbalance is not a problem, and you did not need to undersample! https://stats.stackexchange.com/questions/357466/are-unbalanced-datasets-problematic-and-how-does-oversampling-purport-to-he https://www.fharrell.com/post/class-damage/ https://www.fharrell.com/post/classification/ https://stats.stackexchange.com/a/359936/247274 https://stats.stackexchange.com/questions/464636/proper-scoring-rule-when-there-is-a-decision-to-make-e-g-spam-vs-ham-email https://twitter.com/f2harrell/status/1062424969366462473?lang=en — Dave, Jul 31 '21 at 21:12

score 0 · Answer 1 · answered Mar 08 '19 at 23:11

To estimate efficacy of using sampling techniques you should first split your data between train/test set, apply sampling on training data and then estimates results on test set. The way you are doing you can leakage information and get unrealistic performances results.

Adjusting precision recall curve for oversampling

1 Answers1