1

I want to ask how to prune a decision tree CLASSIFIER? I know that for decision tree REGRESSOR, we usually look at the MSE to find the max depth, but what about for classifier? I have been using confusion matrix and prediction accuracy score to evaluate the performance of the model at each depth, but the model continues to have a high false-negative rate, I wonder how else can I prune the model.

Thank you and I wish you good health!

Edward Lam
  • 111
  • 1

2 Answers2

2

Accuracy is an improper scoring rule. See e.g., https://stats.stackexchange.com/a/359936/173546.

The problem of imbalance can be circumvented by using a proper scoring rule. For example, prune based on squared error loss on predicted probabilities (a.k.a. the brier score). This is in fact quite similar to pruning a regression tree based on mean squared error. It does not make a lot of sense to me to grow a tree by minimizing the cross-entropy or Gini index (proper scoring rules) and then prune a tree based on misclassification rates.

Marjolein Fokkema
  • 1,363
  • 6
  • 22
1

You can use any metric you want. The best metric to use depends on the data you have. You can consider using the F1 score. Depending on the averaging technique you use, you can nudge things towards reducing false negatives.

If you are using python/sklearn, you can pick your averaging method in the argument (https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html).

roundsquare
  • 700
  • 3
  • 13
  • 1
    Thank you very much. I used grid search cv to look for the best parameters for my decision tree classifier and used weighted scoring for the scoring method because my dataset is unbalanced -- there is a large number of false cases in the label. Did I use the right scoring method here (for this reason)? – Edward Lam Apr 01 '20 at 00:28
  • Sorry for the late reply. I assume you mean you used the $F_\beta$ score? If so, that is an appropriate way to do this. What I often do is decide what my "goal" is in terms of the confusion matrix and then select a value for $\beta$ and a target score for $F_\beta$. If you don't set your goal unrealistically high, this works pretty well. – roundsquare Aug 25 '21 at 12:21