Using the randomforest package in R, I am getting 100% accuracy on the training dataset. Here is a reproducible example :
library(randomForest)
#### generate dataset ####
n.obs <- 10000
# two predictors
x <- matrix(NA, ncol=2, nrow=n.obs)
x[,1] <- rnorm(n.obs)
x[,2] <- rnorm(n.obs)
# y is binary. It depend on both predictors, but contains noize
y <- as.factor(1*((x[,1]+x[,2]+rnorm(n.obs))>0))
# split the dataset in two halves
split <- round(n.obs/2, 0)
x.train <- x[1:split,]
x.test <- x[(split+1):n.obs,]
y.train <- y[1:split]
y.test <- y[(split+1):n.obs]
#### train the forest ####
fit <- randomForest(x=x.train,
y=y.train,
ntree=1000,
mtry=2,
keep.forest=T)
#### Predict both sets ####
predictions.train <- predict(fit, newdata = x.train)
predictions.test <- predict(fit, newdata = x.test)
#### Compute accuracy on both sets ####
sum(predictions.train == y.train)/length(y.train) # 100% accuracy
sum(predictions.test == y.test)/length(y.test) # ~77% accuracy
Note that I am not interested in the OOB error. In Breiman's original paper, it is mentionned that overfitting is not an issue with this algorithm. What am I missing?