1

I am trying to use an ensemble classifier (honing in on Matlab fitcensemble). I've also explored using a single decision tree as well as tree bagging (Matlab fitctree, TreeBagger)

Simple binary (A/B) classification. My training dataset is imbalanced (~5% B). Currently playing with 24 features, but that could change. But I have ~1.7 million of rows. I could have 10x that number of rows if I wanted.

I am trying to decide on parameters related to tree size. "Min Leaf Size" and "Max Number of Splits". I'm exploring the ('OptimizeHyperparameters','all') option which is exploring a variety of settings, but it is showing what I already figured out, which is that having a large number of splits gives better performance. (It seems to be honing in on the AdaBoostM1 method.)

When my max number of splits is ~1/2 my number of rows I am getting Sensitivity (for B): 98% (train), 86% (test) Specificity (for A): 99% (train), 93% (test)

Really I want better than 86% Sensitivity, however my gut tells me having N splits be ~1/2 rows is too high.

The asker in this question states they set their maximum number of splits to 1/16th the number of rows in their dataset:

Balance classifier performance (boosting ensemble)

That sounds like they are calling upon a rule of thumb (which I value), but its just one example. Can anyone provide guidance/reference for how many splits is reasonable? Beyond simply trying many options and see what works?

bliswell
  • 51
  • 5

0 Answers0