I have a train data set and a validation set using which I wish to optimize the hyperparameter that is the number of trees in a binary classification random forest (scikit-learn). (As Sycorax explained in the comments, accuracy will only increase with increasing forest size, however I would still like to compare accuracy between different sizes to optimize with regard to runtime/accuracy)
I want to plot the area-under-ROC on the training and validation set for each number of trees considered, to see how the classifier improves with increasing forest size.
Does it make a difference whether I:
- Train a complete forest with 100 trees, classify the validation samples starting with the full tree, and then repeat the classification after removing one random tree from the forest in each repetition, or
- Completely retrain an entire new forest for every number of trees considered?
The first method is obviously much faster, but does it somehow introduce errors, biases or otherwise invalidate the analysis?
In case someone asks: I am not using cross-validation or bootstrapping to generate the validation set since they are sampled from different domains in this use case. (i.e. the data is the result of some processing of DNA sequencing reads in which the genome's species will differ between training and validation/inference task)