Suppose I have a data set with 1000 observations. I want to train and test a Classification Model to predict a target variable as true or false. However, in my observation set, true occurs only say 10% of the time. So I have 900 false labels and 100 true labels.
Suppose I want to split this data set into subsets for training and testing in a 70/30 ratio. What is the most appropriate approach? As I see it, I can:
(a) Simply take a random 30% for the test set. But this could possibly contain very few or no true labels; OR
(b) I can force the training and testing set to be split in a way that there is a 10% true representation in each set.
Which of these is more correct?