Balanced accuracy for decisions trees with unbalanced data

Question

I have a question concerning decision trees and unbalanced data. My dependent variable accounts for around 2% of the entire dataset and is binary (0 or 1). Here are the steps I follow:

Note that I'm both interested in identifying the relevant variables and predictive power.

1) I balance data to have a 50/50 split (as advised in this post Training a decision tree against unbalanced data)

2) I run my tree in R on the training dataset

3) I predict the validation dataset

4) I compute the balanced accuracy (again, as advised in the above post)

This is where I'm confused, my balanced accuracy is 0.75 (vs 0.65 accuracy), but still way below the no information rate of 0.98. Does this mean my model is bad? What should I do/compare to at this point?

EDIT: I have to compare the balanced accuracy of my model to the balanced accuracy of the "non-information" model, which is going to be 0.5 all the time (as the formula is (0.5*TP)/(TP+FN)+(0.5*TN)/(TN+FP), so if you classifies everything as positive or negative, results will always be 0.5).

Let me know if I'm mistaken.

what is your balanced accuracy of the no information model? (it isn't 0.98) — charles, Nov 12 '15 at 00:21
That is a very good point, it didn't occur to me I indeed have to compare it to the balanced accuracy of the no information model, and not the accuracy. Thank you ! — , Nov 12 '15 at 16:10
Actually, now that I had time to work on this, how do you get the "no information model" balanced accuracy, knowing that balanced accuracy is defined by the formula (0.5*TP)/(TP+FN)+(0.5*TN)/(TN+FP), in that case it would be 0.5, since it would classify everything as negative (or positive)? — , Nov 18 '15 at 22:10
yes. that is how I would calculate it - giving result of 0.5 — charles, Nov 19 '15 at 11:23

score 1 · Answer 1 · answered Nov 13 '15 at 15:20

Hope the above comment is useful. I would add - less an answer than general advice - that it is useful to try different sampling ratios. There is nothing special about at 50/50 sampling ratio. Usually I use a range of ratios from 1:1 to 1:10 and compare the resulting confusion matrixes. You'll see that the cost (in terms of false positives) of obtaining true positives changes with the sampling ratio. The sampling ratio is thus a parameter than should be tunes to the application specific cost of false positives. If you don't have a specific cost in mind - 1:1 is often used.

Balanced accuracy for decisions trees with unbalanced data

1 Answers1