17

I am fairly new to random forests. In the past, I have always compared the accuracy of fit vs test against fit vs train to detect any overfitting. But I just read here that:

"In random forests, there is no need for cross-validation or a separate test set to get an unbiased estimate of the test set error. It is estimated internally , during the run..."

The small paragraph above can be found under the The out-of-bag (oob) error estimate Section. This Out of Bag Error concept is completely new to me and what's a little confusing is how the OOB error in my model is 35% (or 65% accuracy), but yet, if I apply cross validation to my data (just a simple holdout method) and compare both fit vs test against fit vs train I get a 65% accuracy and a 96% accuracy respectively. In my experience, this is considered overfitting but the OOB holds a 35% error just like my fit vs test error. Am I overfitting? Should I even be using cross validation to check for overfitting in random forests?

In short, I am not sure whether I should trust the OOB to get an unbiased error of the test set error when my fit vs train indicates that I am overfitting!

jgozal
  • 819
  • 2
  • 9
  • 15
  • OOB can be used for determining hyper-parameters. Other than that, for me, in order to estimate the performance of a model, one should use cross-validation. – Metariat Apr 17 '16 at 16:03
  • @Matemattica when you talk about hyper-parameters what exactly are you talking about? Sorry for my lack of knowledge in the topic – jgozal Apr 17 '16 at 16:04
  • number of trees and of features randomly selected at each iteraction – Metariat Apr 17 '16 at 16:14
  • I know this a completely different question but how do you determine the number of trees and sample of features at each iteration from an error? – jgozal Apr 17 '16 at 16:15
  • I think I was able to figure that one out. Still a little confused about my original question – jgozal Apr 17 '16 at 16:22
  • Can you explain what do you mean by **fit vs test** against **fit vs train**? Is it accuracy (error) in the testing/ training set? – Metariat Apr 17 '16 at 16:26
  • @Matemattica the accuracy of my predicted testing vector against my test response vector compared to the accuracy of my predicted training vector to my train response vector – jgozal Apr 17 '16 at 16:28
  • 1
    May be this could help: http://stats.stackexchange.com/a/112052/78313 In general I've never seen such an difference in RF! – Metariat Apr 17 '16 at 16:34
  • I have - check this post: http://stats.stackexchange.com/questions/169357/random-forest-overfitting-r. Let me check yours – jgozal Apr 17 '16 at 16:37
  • Oh wow what! I was doing `predict(model, data=train)`! – jgozal Apr 17 '16 at 16:39
  • "The first option gets the out-of-bag predictions from the random forest. This is generally what you want, when comparing predicted values to actuals on the training data." Ok so I am super confused now. Is the OOB then pretty much the error of fit vs train response? – jgozal Apr 17 '16 at 16:40
  • I am reading more on the topic on a follow up question here: http://stats.stackexchange.com/questions/162353/what-measure-of-training-error-to-report-for-random-forests – jgozal Apr 17 '16 at 16:46

2 Answers2

22
  • training error (as in predict(model, data=train)) is typically useless. Unless you do (non-standard) pruning of the trees, it cannot be much above 0 by design of the algorithm. Random forest uses bootstrap aggregation of decision trees, which are known to be overfit badly. This is like training error for a 1-nearest-neighbour classifier.

  • However, the algorithm offers a very elegant way of computing the out-of-bag error estimate which is essentially an out-of-bootstrap estimate of the aggregated model's error). The out-of-bag error is the estimated error for aggregating the predictions of the $\approx \frac{1}{e}$ fraction of the trees that were trained without that particular case.
    The models aggregated for the out-of-bag error will only be independent, if there is no dependence between the input data rows. I.e. each row = one independent case, no hierarchical data structure / no clustering / no repeated measurements.

    So the out-of-bag error is not exactly the same (less trees for aggregating, more training case copies) as a cross validation error, but for practical purposes it is close enough.

  • What would make sense to look at in order to detect overfitting is comparing out-of-bag error with an external validation. However, unless you know about clustering in your data, a "simple" cross validation error will be prone to the same optimistic bias as the out-of-bag error: the splitting is done according to very similar principles.
    You'd need to compare out-of-bag or cross validation with error for a well-designed test experiment to detect this.

cbeleites unhappy with SX
  • 34,156
  • 3
  • 67
  • 133
  • Could you please add citations for these claims? They are very interesting and relevant for me and I would like to read more. – Angus Campbell Sep 22 '21 at 02:03
12

Out-of-bag error is useful, and may replace other performance estimation protocols (like cross-validation), but should be used with care.

Like cross-validation, performance estimation using out-of-bag samples is computed using data that were not used for learning. If the data have been processed in a way that transfers information across samples, the estimate will (probably) be biased. Simple examples that come to mind are performing feature selection or missing value imputation. In both cases (and especially for feature selection) the data are transformed using information from the whole data set, biasing the estimate.

George
  • 419
  • 3
  • 8