This is a short simulation to check the coverage of the random forest confidence intervals introduced in the following paper, when used as predictive intervals.
S. Wager, T. Hastie and B. Efron. Confidence Intervals for Random Forests: The Jackknife and the Infinitesimal Jackknife. Journal of Machine Learning Research, 15, pp 1625-1651 (2014)
The data is simulated from the process described in Section 4.3 of
J. Friedman. Multivariate Adaptive Regression Splines. Annals of Statistics. 19(1), pp 1-67 (1991)
There are six independent predictors $X_1,\dots,X_6$, each of them with $\text{U}[0,1]$ distribution, and an independent random variable $\epsilon$ having a $\text{N}(0,1)$ distribution. The response variable is defined as $$ Y = 10 \sin (\pi X_1 X_2) + 20 (X_3 -1/2)^2 + 10 X_4 + 5 X_5 + \epsilon. $$
friedman <- function(n, p = 6) {
X <- matrix(runif(n*p), nrow = n, ncol = p)
y <- 10*sin(pi*X[, 1]*X[, 2]) + 20*(X[, 3] - 0.5)^2 + 10*X[, 4] + 5*X[, 5] + rnorm(n)
as.data.frame(cbind(y, X), col.names = c("y", sapply(1:p, function(i) paste0("x_", i))))
}
We generate a training sample of size 1.000 and a test sample of size 100.000.
set.seed(42)
n <- 10^3
training <- friedman(n)
n_tst <- 10^5
test <- friedman(n_tst)
library(ranger)
rf <- ranger(y ~ ., data = training, num.trees = 10^3, keep.inbag = TRUE)
pred <- predict(rf, data = test, type = "se", se.method = "infjack")
y_hat <- pred$predictions
Lower <- y_hat - 1.96 * pred$se
Upper <- y_hat + 1.96 * pred$se
mean((Lower <= test$y) & (test$y <= Upper))
We set a nominal coverage of 95%, but the simulation results in a coverage of approximately 20.2%.
The question is about the low effective predictive coverage obtained in this simulation.
Notes:
We're computing the $1-\alpha$ level confidence intervals as described in their footnote (page 3): $\hat{y}\pm z_\alpha \hat{\sigma}$.
As pointed by usεr11852 in the comments below, the effective coverage decreases as we increase the number of trees in the forest. Examples:
num.trees = 50 => effective coverage = 0.94855
num.trees = 100 => effective coverage = 0.76876
num.trees = 150 => effective coverage = 0.68959
num.trees = 200 => effective coverage = 0.56038
num.trees = 250 => effective coverage = 0.55393
num.trees = 300 => effective coverage = 0.32304
num.trees = 350 => effective coverage = 0.55413
num.trees = 400 => effective coverage = 0.26372
num.trees = 450 => effective coverage = 0.26232
num.trees = 500 => effective coverage = 0.23139