-1

Using the randomforest package in R, I am getting 100% accuracy on the training dataset. Here is a reproducible example :

library(randomForest)

#### generate dataset  ####
n.obs <- 10000 

# two predictors
x <- matrix(NA, ncol=2, nrow=n.obs)
x[,1] <- rnorm(n.obs)
x[,2] <- rnorm(n.obs)

# y is binary. It depend on both predictors, but contains noize
y <- as.factor(1*((x[,1]+x[,2]+rnorm(n.obs))>0)) 

# split the dataset in two halves
split <- round(n.obs/2, 0)
x.train <- x[1:split,]
x.test <- x[(split+1):n.obs,]
y.train <- y[1:split]
y.test <- y[(split+1):n.obs]

#### train the forest   #### 
fit <- randomForest(x=x.train,
                y=y.train,
                ntree=1000,
                mtry=2,
                keep.forest=T)

#### Predict both sets  ####
predictions.train <- predict(fit, newdata = x.train)
predictions.test <- predict(fit, newdata = x.test)

####  Compute accuracy on both sets ####
sum(predictions.train == y.train)/length(y.train) # 100% accuracy
sum(predictions.test == y.test)/length(y.test) # ~77% accuracy

Note that I am not interested in the OOB error. In Breiman's original paper, it is mentionned that overfitting is not an issue with this algorithm. What am I missing?

Pierre Cattin
  • 482
  • 2
  • 11
  • What are you expecting... care to elaborate? – zacdav Aug 07 '18 at 12:24
  • In Breiman's original paper, it is mentionned that overfitting is not an issue with this algorithm. Here we clearly see an overfit, as the accuracy is 100% in a noisy dataset. Accuracy on testing set is only 77% – Pierre Cattin Aug 07 '18 at 12:29
  • The randomforest model you're building can split each tree up to the maximum possible number of nodes. This could mean one node per observation which will lead to 100% accuracy in the train set. This happens because by default `maxnodes = NULL`. Try to experiment with `maxnodes = 5, 20, 30, 40, ...` and see what happens. – AntoniosK Aug 07 '18 at 12:40
  • 1
    Also, maybe the way the paper explains is not very clear, but what they wanted to say is that "The testing performance of this model doesn't decrease even if the performance on the training set tends to 100%". Overfitting is not (just) reaching close to 100% in your train set, but when BECAUSE of reaching that 100% in train set your test set predictions are becoming worse than they could be. – AntoniosK Aug 07 '18 at 12:46
  • I see your point. A single tree will fully fit the dataset used to grow it. However, if I understand the implementation correctly, none of the trees sees the full dataset. Random forests use bagging, so the sample used for each tree is bootstrapped. In these conditions, no tree can possibly fully fit the dataset, so I doubt that the ensemble could produce a perfect fit. – Pierre Cattin Aug 07 '18 at 12:52
  • 1
    `maxnodes` (ie. the depth of each tree) is not the only hyperparameter that produces that 100% accuracy in train set. Even if you keep your exact code and change to `ntree=10` and allow for max depth you won't reach 100%, because as you said the model never sees the whole train set. Only after around `ntree=100` I start getting 100% accuracy, which means that the model has seen the full train set (over all those 100 trees). – AntoniosK Aug 07 '18 at 13:05
  • Yes you're right, good point. I guess it's just my intuition that's wrong. I thought that the fact that each tree sees 2 thirds of the dataset in average would cause the decision boundary of the forest to be smoother, but it's apparently not the case. – Pierre Cattin Aug 07 '18 at 13:11
  • 1
    I'd also add a point regarding the meaning of "overfitting". The fact that accuracy is 100% on the training set, while on the test set is smaller, doesn't necessarily imply that you are overfitting. In my view, a model overfits when as the train accuracy increases, the test accuracy decreases. With RF that's seldom the case. An increase in train accuracy in RF implies very often a better test performance. That's not the case with higher capacity models. –  Aug 07 '18 at 13:18
  • Well in the example above, RF produces a highly non-linear decision boundary, while the true data generating process is linear. A linear decision boundary would perform better in this case, so RF is clearly overfitting! – Pierre Cattin Aug 07 '18 at 13:26
  • @nicola is right. It's what I mentioned above "The testing performance of this model doesn't decrease even if the performance on the training set tends to 100%". The correct (and more general) statement is "...performance on the training set increases". I just used "tend to 100%" to go along with what we've observed here. – AntoniosK Aug 07 '18 at 13:30
  • 1
    I recommend reading https://stats.stackexchange.com/questions/348245/do-we-have-to-tune-the-number-of-trees-in-a-random-forest/348246#348246 which discusses the relationship between accuracy and the number of trees in a random forest. – Sycorax Aug 13 '18 at 14:18

1 Answers1

1

100% accuracy does not have to mean overfitting. There is nothing "wrong" with model always returning correct predictions. The problem is if the model has markedly better performance on the training set, as compared to the test set, because this means that it didn't generalize anything from the data. It means that it just memorized the data, so it will be able to make correct predictions only in cases that are identical to the ones that were seen in the training set. This would be a problem because the model would not be able to extrapolate beyond what it seen in the training set. But in perfect world, we would expect to see 100% accuracy both in training and test set, why should we aim at imperfect predictions?

Tim
  • 108,699
  • 20
  • 212
  • 390