Breiman (Breiman, L. (2001), Random Forests, Machine Learning 45(1), 5-32) states that "In each bootstrap training set, about one-third of the instances are left out". Providing the following example:
library(randomForest)
library(MASS)
data(fgl)
set.seed(17)
table(fgl$type)
model.rf <- randomForest(type ~ ., data = fgl, mtry = 2)
how do I define the number of average proportion of out-of-bag cases in the analysis (which corresponds with Breiman´s statement)?
My take on it is to see the average OOB cases per number of trees, from model.rf$oob.times
diveded by the number of ntree
and then put that number in relation to the data size aka number of rows/records in the data to see the proportion of OOBs in the whole dataset, hence:
sum(model.rf$oob.times)/model.rf$ntree/nrow(fgl)
It comes down to 0.3664019 which is somewhat one third and close to another thread on CV that states the out-of-bag sample size theoretically (0.3678).
Would this be an accurate answer to my question?