Should models built using under-sampled data be evaluated against the population

Question

I have a dataset of 11 mil. rows with a 1:10 ratio between minority and majority classes.
To train a model, I have selected all the minority class members and 1/3 of the majority class.
The ratio is now 3:10 and the sample data is comprised of 4.33 mil rows
I have fit an XGBoost model on this undersampled data with cross validation and 'ok' result for train test and validation sets (all derived from 4.33 mil rows).

My question now is, should I also train/test the model against the full 11 mil rows or can I proceed with the model I have now?

Depends on what you want to learn or achieve. If you do a train/test on imbalanced data there is a risk of a biased model that appears to work well but doesn't. What is reasonable depends on whether there are special restrictions on what you can do with the data and the risk implications of under- and over-analysis. What is a reasonable strategy for assessing impact of website button color on click rate is very different from a phase 3 clinical trial. — ReneBt, Sep 07 '21 at 04:20

score 0 · Accepted Answer · answered Sep 08 '21 at 07:23

After consultations with some Data Scientists and a bit of googling, it appears that there's no single standard, as commented by @ReneBt.

However, it is recommended that the model be run against the full dataset available with labels and see the performance loss. Such a loss is expected, as under-sampled data has much less information than its super-set.

Now whether that loss is acceptable, is something that depends on a lot of non-technical factors (again, well pointed out by @Rebe)

A good reference: [https://machinelearningmastery.com/train-final-machine-learning-model/] - Which answers common questions about "finalizing" a model.

Should models built using under-sampled data be evaluated against the population

1 Answers1