0

I have a train data set and a validation set using which I wish to optimize the hyperparameter that is the number of trees in a binary classification random forest (scikit-learn). (As Sycorax explained in the comments, accuracy will only increase with increasing forest size, however I would still like to compare accuracy between different sizes to optimize with regard to runtime/accuracy)

I want to plot the area-under-ROC on the training and validation set for each number of trees considered, to see how the classifier improves with increasing forest size.

Does it make a difference whether I:

  1. Train a complete forest with 100 trees, classify the validation samples starting with the full tree, and then repeat the classification after removing one random tree from the forest in each repetition, or
  2. Completely retrain an entire new forest for every number of trees considered?

The first method is obviously much faster, but does it somehow introduce errors, biases or otherwise invalidate the analysis?

In case someone asks: I am not using cross-validation or bootstrapping to generate the validation set since they are sampled from different domains in this use case. (i.e. the data is the result of some processing of DNA sequencing reads in which the genome's species will differ between training and validation/inference task)

JMC
  • 103
  • 3
  • 3
    You don't need to tune the number of trees in a random forest. https://stats.stackexchange.com/questions/348245/do-we-have-to-tune-the-number-of-trees-in-a-random-forest/348246#348246 – Sycorax Feb 01 '21 at 01:02
  • 1
    @Sycorax Thanks. In that case, I would still like to weigh the runtime and memory cost of a larger forest against the increased accuracy, and perform the analysis for that reason. – JMC Feb 01 '21 at 01:07
  • That doesn't seem to solve a real problem unless you're able to profit more from spending compute/memory/etc on the larger forest. In my experience, we more often face hard limits, so we have to size the model to fit those hard constraints. In other words, do you realize a profitable return on the marginal cost of each additional tree? If you do, it's worth pricing that out. Otherwise, just use the largest practical model. – Sycorax Feb 01 '21 at 01:12
  • 1
    @JMC (I second Sycorax's advice) in that case treat the number of trees as a constraint rather than a variable (e.g. 128 trees) and aim to optimise other hyperparameters) – usεr11852 Feb 01 '21 at 01:12
  • Also, please note that Accuracy (or whatever metric) does not increase as we add more trees. It plateaus at some point. Rather anecdotally: Usually we see very little gains above 500 trees and no gains above 1000 trees. (Assuming we are not put some crazy constraints on our trees, that is.) – usεr11852 Feb 01 '21 at 01:16
  • @Sycorax In my use case I cannot know the hard limit at this point in time and as inference complexity is linear in the number of trees it might become an optimization angle once the other bottlenecks in the data pipeline are known (The classifier will be used "live" with fresh data in a larger software) Also, it's interesting for academic reasons. – JMC Feb 01 '21 at 01:20

1 Answers1

2

No, you do not need to retrain. Because the trees are built independently, removing one tree is (on average) the same as training one fewer tree.

See also Sycorax's link in comments and this datascience.SE question.

If you're really wanting to compare different numbers of trees, it would probably be more robust to train more than one forest for each number of trees: you'd expecting to see decreasing variance in the ensemble as it grows, so examining some confidence intervals might be interesting.

Ben Reiniger
  • 2,521
  • 1
  • 8
  • 15
  • While the trees are independent, the datapoints of my (number-trees --> area under roc) curve would be dependent in the first case. Does that not make a difference at all? – JMC Feb 01 '21 at 01:35
  • By "on average" do you mean that it doesn't matter if I'm using a large number of sets of trees for each data point but that it might produce a misleading curve when only using one (sub)forest per datapoint? – JMC Feb 01 '21 at 10:32
  • @JMC. Yes, I meant averaging over all the possible additional trees. And you're also right that the subtrees are dependent, and that it might be worthwhile to consider multiple forests with each number of trees; I've added that into the answer. – Ben Reiniger Feb 02 '21 at 14:56