I want to predict whether a loan is default or fully paid, with about 20 features and 10,000 historical observations.
Among the data over 85% are fully paid, 15% are default, I want to try classification tree, but it won't split. Do I have to balance the outcome first? That is to say, I randomly sample 1500 out of 8500 fully paid obs and combine with the 1500 default obs, then I continue sample the 80% of the 3000 obs to be the training set, the rest 20% to be the test set?