10

In the original paper, I was under the impression that the RF couldn't really overfit.

However, in practice I'm seeing that increasing 'ntree' sometimes increases test set error. Is this due to randomness, or is there an optimal level for 'ntree'? Right now, I'm only using cross-validation to choose the optimal 'mtry'.

I am using R and the randomForest package implementation.

John
  • 251
  • 1
  • 2
  • 9
  • I am having the same problem as John, could anyone comment? In a 10-fold CV (inner loop of a nested CV), the variance of the cross-validation score (F1) between folds is quite high. I would guess that this makes CV not reliable (?) for model selection... Thanks –  Jul 03 '15 at 07:52

1 Answers1

12

Section 15.3.4 of Elements of Statistical Learning (Hastie et al 2009) (PDF is freely available) discusses this. In short, depending on your point of view, random forest can overfit the data, but not because of ntree.

Hastie et al (2009, page 596) states "it is certainly true that increasing $\mathcal{B}$ [the number of trees] does not cause the random forest sequence to overfit". However, they also state that "the average of fully grown trees can result in too rich a model, and incur unnecessary variance" (op cit.).

In short there may be some overfitting due to the use of fully grown trees (the "average" that they talk about in the second statement), which may be showing up as you add more trees to the forest.

Hastie et al (2009) suggest that this "overfitting" does not often cost much in terms of prediction error, and if you don't tune that parameter then tuning is simplified.

If you want to assure yourself, you could try tuning over mtry plus nodesize and/or maxnodes the latter two which controls the depth of trees fitted.

Sycorax
  • 76,417
  • 20
  • 189
  • 313
Gavin Simpson
  • 37,567
  • 5
  • 110
  • 153
  • Thanks. A slightly unrelated question but how 'random' are Random Forests? Could my deviation in test/train error from different `ntree` values be simply due to randomness? The dataset I am training on is very small. – John Apr 01 '15 at 20:04
  • Depends how you ran the software. If you set the same random number seed before each call to `randomForest()` then no, a particular tree would choose the same set of `mtry` variables at each node split. The randomness comes from the selection of `mtry` variables with which to form each node. If you run the model several times you may get small differences. These methods aren't really designed for small data sets and if you are randomly splitting that into training an test sets that could also cause things to vary during repeated runs. – Gavin Simpson Apr 01 '15 at 20:24
  • I'm using repeated cross-validation (k = 10, repeats = (10, 20, ...)) to optimize for `mtry`. It seems that both the CV Error and optimal `mtry` are varying wildly even with 200 repeats. – John Apr 01 '15 at 21:07