I have a question concerning decision trees and unbalanced data. My dependent variable accounts for around 2% of the entire dataset and is binary (0 or 1). Here are the steps I follow:
Note that I'm both interested in identifying the relevant variables and predictive power.
1) I balance data to have a 50/50 split (as advised in this post Training a decision tree against unbalanced data)
2) I run my tree in R on the training dataset
3) I predict the validation dataset
4) I compute the balanced accuracy (again, as advised in the above post)
This is where I'm confused, my balanced accuracy is 0.75 (vs 0.65 accuracy), but still way below the no information rate of 0.98. Does this mean my model is bad? What should I do/compare to at this point?
EDIT: I have to compare the balanced accuracy of my model to the balanced accuracy of the "non-information" model, which is going to be 0.5 all the time (as the formula is (0.5*TP)/(TP+FN)+(0.5*TN)/(TN+FP), so if you classifies everything as positive or negative, results will always be 0.5).
Let me know if I'm mistaken.