Machine Learning - How to Sample Test and Training Data for Rare Events

Question

Suppose I have a data set with 1000 observations. I want to train and test a Classification Model to predict a target variable as true or false. However, in my observation set, true occurs only say 10% of the time. So I have 900 false labels and 100 true labels.

Suppose I want to split this data set into subsets for training and testing in a 70/30 ratio. What is the most appropriate approach? As I see it, I can:

(a) Simply take a random 30% for the test set. But this could possibly contain very few or no true labels; OR

(b) I can force the training and testing set to be split in a way that there is a 10% true representation in each set.

Which of these is more correct?

usεr11852 · Accepted Answer · 2019-07-30T00:45:31.860

Both approaches are equally correct.

If we want a hold-out test sample probably using stratification (Option B mentioned) is probably more appropriate because it ensures that our testing population has we exactly the same distribution as our training one. If we just train our classifier repeatedly (e.g. 100x bootstraps samples) the observed variation in the proportions will be attenuated and is probably preferable as we train in more "realistic conditions"and allow for sampling variation more explicitly.

We usually stratify to correct issues with our classifier's training, not because it is "more" (or "less") correct than not stratifying. For example, GLM-based routines are not strongly affected by class imbalance (see here for more information), while others (e.g. SVM-like routines) tend to be influenced more requiring some actions from our part (e.g. reweighing of training sample for the case of SVMs). CV.SE has a great thread on the matter: When is unbalanced data really a problem in Machine Learning? I would urge you to read.

@user11852 thanks for your answer and the link which you posted. Both are helpful — Fritz45, Jul 30 '19 at 08:06

score 1 · Answer 2 · answered Jul 31 '19 at 21:17

Further to the answer from user11852 above - I also recently learned from Kubat's excellent book "An Introduction to Machine Learning" that the scenario described in the question above is called Imbalanced Training Sets. There are two ways in which this can typically handled:

Majority-Class Undersampling ("the mechanical approach"). In cases where we are mainly interested to model the phenomena in the under-represented class (e.g. oil spills vs non-oil spills), then we can deliberately under-sample from the over-represented class and this could improve our model depending on our objectives.
Oversampling the Minority Class - if the training set is so small that any reduction of the under-represented class is impractical. In this case, rather than removing majority class observations, we add examples of the minority class. According to Kubat's book, this can be done by simply adding copies of the minority class observations, or by creating slightly modified versions of it.

I am sure option 2 needs to be approached with care. Details are in the book. I hope this helps someone.

score 1 · Answer 3 · answered Aug 01 '19 at 01:07

To add to the two great answers given above, I would also extend this to cross-validation.

If you sample 70/30 for training and hold-out and want to do cross-validation on the 70% training set, keep in mind that the more folds you do in the cross-validation, the smaller the number of incidents in each of the partitioned sets and the larger chance that one of these groups has none of these rare events.

If you have a perfect stratified sample (for 700/300 with 10% positive classes), a 10-fold CV is on average expecting more 7 positive classes per group with a non-zero chance that you may have one or none present.

Hope this is useful!

Machine Learning - How to Sample Test and Training Data for Rare Events

3 Answers3