1

For linear regression it is possible to place confidence intervals around the R-squared, either by formula or bootstrapping. Random Forest models, as regression model, return the "explained variation", or better, "captured variation". However, I have not found a method to estimate confidence intervals around this single value. I am not searching for prediction intervals, but the confidence intervals on the pseudo-R-squared. Is it possible to estimate the confidence intervals of an RF model? Would it be possible to correlate the predicted values for each tree with the explanatory variable of the dataset and extract the 2.5 and 97.5%? See the example below.

library(randomForest)
library(pdp)

set.seed(1)
data <- data.frame(y=1:1000, 
               x=c(1:500+rnorm(500, 2,100), rnorm(500,-300, 100)))

mod <- randomForest(y~x, data=data, nodesize=200, keep.inbag=T)

#plot results
pdplot <- pdp::partial(mod, pred.var = "x", grid.resolution=30)
plot(data$x, data$y)
lines(pdplot$x, pdplot$yhat, col="red", lwd=2)

#extract trees
predtree <- as.data.frame(predict(mod, data, predict.all=T)[["individual"]])

#Correlated each tree prediction versus expected
predrsq <- as.numeric()
for(t in 1:ncol(predtree)){
  print(t)
  predrsq <- c(predrsq, cor(data$y, predtree[,t])^2)
}

#Captured variation of the model.
mod$rsq[length(mod$rsq)]

#2.5 and 97.5% for all 500 trees
quantile(predrsq, c(.025, .975))

One of the issues becomes that when the node size is in default mode the 2.5 and 97.5% intervals are place higher than the captured variation of the model (overfitting?). Thank you in advance.

A4-paper
  • 63
  • 7
  • 1
    Bootstrap was my first thought, though I am wondering how that plays with the bootstrapping within the RF model. – Dave Dec 02 '20 at 12:15
  • Then I would be bootstrapping the bootstrap. Its possible, but relative time consuming considering I would go for 1000 trees and 1000 bootstraps. – A4-paper Dec 02 '20 at 12:49

0 Answers0