1

Breiman (Breiman, L. (2001), Random Forests, Machine Learning 45(1), 5-32) states that "In each bootstrap training set, about one-third of the instances are left out". Providing the following example:

library(randomForest)
library(MASS)
data(fgl)
set.seed(17)
table(fgl$type)
model.rf <- randomForest(type ~ ., data = fgl, mtry = 2)

how do I define the number of average proportion of out-of-bag cases in the analysis (which corresponds with Breiman´s statement)?

My take on it is to see the average OOB cases per number of trees, from model.rf$oob.times diveded by the number of ntree and then put that number in relation to the data size aka number of rows/records in the data to see the proportion of OOBs in the whole dataset, hence:

sum(model.rf$oob.times)/model.rf$ntree/nrow(fgl)

It comes down to 0.3664019 which is somewhat one third and close to another thread on CV that states the out-of-bag sample size theoretically (0.3678).

Would this be an accurate answer to my question?

Patrik
  • 141
  • 7
  • All three numbers you cite are approximations to $1/e\approx 0.367879$, which is the limiting proportion for large datasets. – whuber Jan 03 '18 at 13:29
  • Could you elaborate Dr. Huber, what is meant by "limiting proportion for large datasets"? – Patrik Jan 03 '18 at 13:55
  • 1
    The expected number of instances left out is always a rational number, and therefore never exactly equal to $1/e$, but as the dataset size increases, it approaches $1/e$. A good search term is [derangement](https://stats.stackexchange.com/search?q=derangement). – whuber Jan 03 '18 at 13:58
  • 1
    Thank you, I understand this RF OOB rule now better. This also speaks in favor of my personal understanding of the issue. – Patrik Jan 03 '18 at 14:01

0 Answers0