CART (rpart) balanced vs. unbalanced dataset

Question

I am fitting a tree (CART) to the olives-dataset. The training data has 436 observations (test data: 136). I have 3 responses (the 'Region' variable) which splits the training data into 116 / 74 / 246 observations.

If I plot the variables eicosenoic and linoleic, I can see an almost perfect classification.

I used a balanced dataset with 74 observations for each response (btw, is that correct or should I use a smaller size than 74 observations?) and got almost the same prediction results of the testdata as for the unbalanced dataset.

That is why I am wondering if a balanced dataset is required in this case? I assume that balancing is not requried but I am not sure and would like to know other opinions.

By the way with Random oversampling I'll get approximately the same confusion matrix for all 3 cases (1. balancing with undersampling, 2. balancing with oversampling and 3. without balancing) and the same test-error. I think this would confirm my assumption that if you have perfectly separated responses (e.g. we have it in the iris data, or in the olives data here), there is no need to balance. I still hope finding agreement :). But I also would appreciate disagreement (especially with a explanation, why balancing still would be necessary). — Giuseppe, Aug 23 '12 at 21:30

score 3 · Accepted Answer · edited Aug 24 '12 at 07:52

If you have well separated classes in the feature space it will not make much of a change on the predictions of the test data whether you have a balanced or an unbalanced training data set as long as you have enough data to identify the classes reasonably well.

If the class distributions of features overlap considerably its a different story. What the right thing to do is depends on your loss function and the class distribution in the future samples that you want to predict.

If the class distribution in future samples is approximately 0.26 / 0.18 / 0.56, as in the training data, and you use the 0-1-loss function to count the number of misclassifications, you will in general get a smaller number of misclassifications if you keep the training data unbalanced.

As a general comment I would always avoid actually throwing away data unless the training data set is huge. If you expect that future samples have a class distribution that differs from that of the training data I would try to incorporate that in the model instead. In a classification tree that could be done by weighting. If you use (naive) Bayes you can simply change prior class probabilities.

score 0 · Answer 2 · edited Apr 13 '17 at 12:44

0

I have offered a related answer under the post 'cart Training a decision tree against unbalanced data'

edited Apr 13 '17 at 12:44

Community

1

answered Apr 07 '17 at 06:36

rf7

749
5
18

Thank you for your answer. Maybe you rephrase your answer from there or we can convert it to a comment – Ferdi Apr 07 '17 at 07:09
At least include a proper link to the other post. – mdewey Apr 07 '17 at 08:06

CART (rpart) balanced vs. unbalanced dataset

2 Answers2