When we up-sample the training set, don't we introduce selection bias?

Question

When doing supervised machine learning in the health or medical domains, we often have a target class that is relatively rare (e.g., prevalence 1-10% of cases). There are a few techniques we can do to address this class imbalance:

Researchers often will use either a screening protocol or some kind of purposive sampling approach to choose cases to hand-review that are more likely to contain examples from the target class. That way, the machine learning algorithm will have enough of these target case examples to learn from.
After hand-review is complete, researchers will further up-sample cases from the minority (target) class so that there is closer to a 50/50 split in the training data. (Or they'll use down-sampling of majority class, SMOTE, etc.)

In my case, I'm going to do both #1 and # 2 to deal with class imbalance. In doing so... cases in the final training set will no longer have the same characteristics (e.g., by gender, age, race/ethnicity) as the true underlying population, and the prevalence of our target class will be artificially inflated.

My question is this: By dealing with class imbalance, don't we inherently create the problem of selection bias? Is there a way to manage both of these concerns? As an added complication, I have access to some hand-coded data already but it's only among young people (aged 10-25) so while I'll be hand-reviewing more cases, the training dataset is already going to be highly skewed towards young folks. It's unclear whether age is associated with the target class or not, but I want to make sure my classifier will perform equally well across all age groups so that I can have an accurate estimate of the true prevalence of the target class in the full population.

Thanks for taking the time to read this question!

Resources I've consulted

In doing research on this, I was also surprised to see that up/down-sampling will adversely impact the posterior probabilities calculated in the classifier. Yikes. I found a paper about how to correct for this in down-sampling (Pozzolo, 2015), but not for up-sampling or SMOTE. It sounds like there is some debate about whether this is even necessary or not (see comment by Max Kuhn on Rstudio link below). Has there been any updates on this?
I've read about using Importance Weights with Density Ratio Estimation, but it sounds like that is applied to the whole feature set, not just key demographics of interest. Also this approach is still in development. I'm an R user (no Python experience), and a newbie R user at that, so I'm looking for a solution that will be at least somewhat manageable to implement.
Jacobusse (2016) discusses this but seems more pre-occupied with the fact that by doing purposive sampling (approach #1) in search of good examples from the target class, we might be totally missing the boat on the comparison class. While this seems important, I'm not sure it fully answers my question. Cases in the training set will continue to be unrepresentative of the true population based on key demographic characteristics like gender, age, race/ethnicity even if we do better sampling for the non-target class.

Sources

Haixiang, G., Yijing, L., Shang, J., Mingyun, G., Yuanyue, H., & Bing, G. (2017). Learning from class-imbalanced data: Review of methods and applications. Expert Systems with Applications, 73, 220-239.

Jacobusse, G., & Veenman, C. (2016, October). On selection bias with imbalanced classes. In International Conference on Discovery Science (pp. 325-340). Springer, Cham.

Dal Pozzolo, A., Caelen, O., Johnson, R. A., & Bontempi, G. (2015, December). Calibrating probability with undersampling for unbalanced classification. In 2015 IEEE Symposium Series on Computational Intelligence (pp. 159-166). IEEE.

https://www.analyticsvidhya.com/blog/2017/07/covariate-shift-the-hidden-problem-of-real-world-data-science/

https://community.rstudio.com/t/adjusting-posterior-model-estimated-probabilities-after-re-balancing-or-applying-case-weight/8994/2

https://github.com/topepo/caret/issues/460

ML: sampling imbalanced dataset leads to selection bias

When you use proper statistical methods, class imbalance is not a problem. https://stats.stackexchange.com/questions/357466/are-unbalanced-datasets-problematic-and-how-does-oversampling-purport-to-he https://www.fharrell.com/post/class-damage/ https://www.fharrell.com/post/classification/ https://stats.stackexchange.com/a/359936/247274 https://stats.stackexchange.com/questions/464636/proper-scoring-rule-when-there-is-a-decision-to-make-e-g-spam-vs-ham-email https://twitter.com/f2harrell/status/1062424969366462473?lang=en — Dave, Aug 27 '21 at 20:58

When we up-sample the training set, don't we introduce selection bias?

Resources I've consulted

Sources

0 Answers0