Prediction problem: Do I have to sample the data set so that the outcomes are balanced?

Question

I want to predict whether a loan is default or fully paid, with about 20 features and 10,000 historical observations.

Among the data over 85% are fully paid, 15% are default, I want to try classification tree, but it won't split. Do I have to balance the outcome first? That is to say, I randomly sample 1500 out of 8500 fully paid obs and combine with the 1500 default obs, then I continue sample the 80% of the 3000 obs to be the training set, the rest 20% to be the test set?

Short answer is no you shouldn't do this. Also, how many distinct "types" can you make with those $20$ features? That is, what is the maximum number of terminal nodes a tree could possibly have for this data set? — probabilityislogic, Jun 07 '14 at 07:48

score 2 · Answer 1 · edited Apr 13 '17 at 12:44

If the dataset you have is representative of the real distribution of the class labels then the fact that labels are imbalanced should be incorporated in your predictions. Also not using data while you have them is rather unadvised. So one solution would be instead of sampling the majority class to try oversampling the minority. In classifiers like SVMs the solution is straight forward by assigning different weights to each class label. In Bayesian approaches you have different priors per class.

Also, check out the answers here: Training a decision tree against unbalanced data

score 1 · Accepted Answer · answered Jun 07 '14 at 02:22

1

If you would like to survey the literature your problem is often referred to as "class imbalance". The main danger you face is that your model can declare 85% accuracy by always guessing "fully paid" and you should dissuade it from doing this.

Ililasfl suggested oversampling the minority class which is something you should certainly try. However if you do not have sufficient information in the minority class to define this class then oversampling may not help. You should try first and see. If that fails modifying the cost to reflect the expected proportion is also a common approach and it is recommended within ililasfl's link. If you are using decision tree software there should be a way to do this. However again if there is insufficient information in the minority class then you might still get a dissatisfying number of false positives and false negatives.

If you are still unhappy with your results you have other options. If there is insufficient information in the minority class then it is possible there is too much heterogeneity in that class thus it would need a larger data representation than what you have in your dataset. If that was the case I would consider a one class classifier. I've never tried a one-class decision tree (see here for a list) but I've had good results with the one-class classifier in LibSVM.

answered Jun 07 '14 at 02:22

Meadowlark Bradsher

1,003
10
23

1

What is wrong with getting 85% accuracy? Why is that "dangerous"? – probabilityislogic Jun 07 '14 at 07:52
1

It's "dangerous" because it is misleading. Accuracy of 85% sounds pretty awesome, however it would be the same as predicting always the majority class. You need a combination of measures like precision/recall, or specificity/sensitivity to avoid this trap. – iliasfl Jun 07 '14 at 12:29
@iliasfl - how is it misleading? If anything, this suggests that any method supposed to be better should get more than 85% accuracy. Always predicting the majority class is not always a bad idea. For example if the prevalence was 99% then it would be hard to do better than predicting the majority class. – probabilityislogic Jun 11 '14 at 12:46
"Always predicting the majority class is not always a bad idea", maybe but definitely it is not a machine learning system. If 99% is the performance you get from a trivial model (it's your baseline) then definitely you can do better than 99% accuracy with a system that actually learns from the data. Depending on the problem and the volume of the data 99% can be pretty low in terms of accuracy. Unless you are OK with 1/100 credit card transactions declined because of a bad fraud detection system, or start loosing 1/100 emails due a to bad spam filter etc. – iliasfl Jun 11 '14 at 13:15
In the fraud detection example, if [1/1200 credit card transactions are fraudulent](http://en.wikipedia.org/wiki/Credit_card_fraud) and you trained a classifier without addressing class imbalance your fraud detection model would almost certainly gain .9991666..% (1199/1200%) accuracy by claiming all transaction were legitimate. How useful is a fraud detection model that claims all transactions are legitimate even the fraudulent ones? – Meadowlark Bradsher Jun 11 '14 at 17:03
I was merely pointing out that no-one is being misled by claiming 85% accuracy for predicting the majority class. For something like fraud - there is clearly a loss function involved, that is not being catpured by % accuracy. It is not the imbalance of the prevalence of each class that's the problem, it's the imbalance of the consequences of making different mistakes (falsely declaring fraud is less costly than falsely declaring legit), as well as correct decisions (correct fraud is much more beneficial than correct legit). – probabilityislogic Jun 12 '14 at 15:06
Subsampling is basically a poor substitute for a proper application of decision theory - which is what fraud detection and credit risk are really about. – probabilityislogic Jun 12 '14 at 15:10

Prediction problem: Do I have to sample the data set so that the outcomes are balanced?

2 Answers2