1

I built a model for a binary target using oversampled data. The population target prevalence is 0.25. I oversampled to 0.5 by keeping the entirety of the minority class and sampling a portion of the majority class. I then built a precision recall table using sklearn's:

#y is the binary target
#PR_CV are out of bag predictions
precision, recall, tr = precision_recall_curve(y, PR_CV)

I'd now want to know what precision and recall look like on the original population. I tried implementing the following, based on this article:

#odds = p/(1-p)
odds = y.mean()/(1-y.mean())
print 'oversampled odds:',odds

original_fraction = 0.25
original_odds = original_fraction/ (1 – original_fraction)
print "original odds:",original_odds

#Scoring_odds = scoring_results / (1 – scoring_results)
scoring_results = PR_CV
print "probability to revert:", scoring_results
scoring_odds = scoring_results/(1-scoring_results)
print"scoring odds:", scoring_odds

#adjusted_odds = Scoring_odds * original_odds / oversampled_odds 
adjusted_odds = scoring_odds * original_odds / odds
print "adjusted odds:", adjusted_odds

#adjusted_probability = 1 / (1 + 1/adjusted_odds))
adjusted_probability = 1 / (1 + 1/adjusted_odds)
print "adjusted probability:", adjusted_probability

I then calculated:

precision_adj, recall_adj, tr_adj = precision_recall_curve(y, PR_CV_adj)

What this returns is the exact same values as precision_recall_curve(y, PR_CV)

That fails my intuition...What is the right way to do this?

The question I'm truing to answer is: if on an oversampled population I expect at the top 5th percentile to have x true positives over X predicted positives (precision), then on the entire population what are my x2 true positives and my X2 predicted positives?

ADJ
  • 415
  • 1
  • 4
  • 10
  • Good news! Class imbalance is not a problem, and you did not need to undersample! https://stats.stackexchange.com/questions/357466/are-unbalanced-datasets-problematic-and-how-does-oversampling-purport-to-he https://www.fharrell.com/post/class-damage/ https://www.fharrell.com/post/classification/ https://stats.stackexchange.com/a/359936/247274 https://stats.stackexchange.com/questions/464636/proper-scoring-rule-when-there-is-a-decision-to-make-e-g-spam-vs-ham-email https://twitter.com/f2harrell/status/1062424969366462473?lang=en – Dave Jul 31 '21 at 21:12

1 Answers1

0

To estimate efficacy of using sampling techniques you should first split your data between train/test set, apply sampling on training data and then estimates results on test set. The way you are doing you can leakage information and get unrealistic performances results.