9

This is a short simulation to check the coverage of the random forest confidence intervals introduced in the following paper, when used as predictive intervals.

S. Wager, T. Hastie and B. Efron. Confidence Intervals for Random Forests: The Jackknife and the Infinitesimal Jackknife. Journal of Machine Learning Research, 15, pp 1625-1651 (2014)

The data is simulated from the process described in Section 4.3 of

J. Friedman. Multivariate Adaptive Regression Splines. Annals of Statistics. 19(1), pp 1-67 (1991)

There are six independent predictors $X_1,\dots,X_6$, each of them with $\text{U}[0,1]$ distribution, and an independent random variable $\epsilon$ having a $\text{N}(0,1)$ distribution. The response variable is defined as $$ Y = 10 \sin (\pi X_1 X_2) + 20 (X_3 -1/2)^2 + 10 X_4 + 5 X_5 + \epsilon. $$

friedman <- function(n, p = 6) {
    X <- matrix(runif(n*p), nrow = n, ncol = p)
    y <- 10*sin(pi*X[, 1]*X[, 2]) + 20*(X[, 3] - 0.5)^2 + 10*X[, 4] + 5*X[, 5] + rnorm(n)
    as.data.frame(cbind(y, X), col.names = c("y", sapply(1:p, function(i) paste0("x_", i))))
}

We generate a training sample of size 1.000 and a test sample of size 100.000.

set.seed(42)

n <- 10^3
training <- friedman(n)

n_tst <- 10^5
test <- friedman(n_tst)

library(ranger)

rf <- ranger(y ~ ., data = training, num.trees = 10^3, keep.inbag = TRUE)

pred <- predict(rf, data = test, type = "se", se.method = "infjack")

y_hat <- pred$predictions

Lower <- y_hat - 1.96 * pred$se
Upper <- y_hat + 1.96 * pred$se

mean((Lower <= test$y) & (test$y <= Upper))

We set a nominal coverage of 95%, but the simulation results in a coverage of approximately 20.2%.

The question is about the low effective predictive coverage obtained in this simulation.

Notes:

We're computing the $1-\alpha$ level confidence intervals as described in their footnote (page 3): $\hat{y}\pm z_\alpha \hat{\sigma}$.

As pointed by usεr11852 in the comments below, the effective coverage decreases as we increase the number of trees in the forest. Examples:

num.trees =  50 => effective coverage = 0.94855
num.trees = 100 => effective coverage = 0.76876 
num.trees = 150 => effective coverage = 0.68959 
num.trees = 200 => effective coverage = 0.56038 
num.trees = 250 => effective coverage = 0.55393 
num.trees = 300 => effective coverage = 0.32304 
num.trees = 350 => effective coverage = 0.55413 
num.trees = 400 => effective coverage = 0.26372 
num.trees = 450 => effective coverage = 0.26232 
num.trees = 500 => effective coverage = 0.23139 
Zen
  • 21,786
  • 3
  • 72
  • 114
  • 1
    I am looking at the IJ-related unit tests within `ranger` [here](https://github.com/imbs-hl/ranger/blob/master/tests/testthat/test_jackknife.R) but it doesn't seem to be extensively tested as far coverage. The standard jackknife `method='jack'` seems slightly better (~44%) but yeah, this looks fishy, maybe a Github issue? I suspect it has something to do with your `mtry` choice too. I seem to get vastly different coverage depending on the choice. (In this example value is 2) – usεr11852 Jan 27 '22 at 16:05
  • 1
    I will be bearish and say that the methodology doesn't work that great. We can get 94.7% coverage if we use `num.trees = 50, mtry = 6`, so most likely it has to do with the variability of the estimator itself. – usεr11852 Jan 27 '22 at 16:18
  • 1
    it is better you to use quantile regression to model L and U by quantile regression using rf or gb. – Davi Américo Jan 29 '22 at 21:12

2 Answers2

7

The way how the OP is written and the results are evaluated, this question appears to be confusing a prediction interval and a confidence interval.

The prediction interval gives an interval estimate of a random variable, while a confidence interval provides an interval estimate of a parameter. The difference between the two is conceptually as large as the difference between the standard deviation of a variable and the standard error of its mean.

EDIT (closing logical gap): The confusion arises often in relation to model predictions, because they can be viewed both as guess for a single random variable as well as an estimate of a conditional mean. END EDIT.

Jackknife estimates are providing uncertainties for an estimated parameter, so the method you try gives a confidence interval for the conditional mean, while your OP asks for a prediction interval.

This would somewhat explain why more trees (-> more robust estimates of the conditional mean) leads to shorter intervals.

Two approaches that I can think of:

Quantile random forest

I wanted to give you an example how to use quantile random forest to produce (conceptually slightly too narrow) prediction intervals, but instead of getting 80% coverage, I end up with 90% coverage, see also @Andy W's answer and @Zen's comment. Similar happens with different parametrizations. I cleaned up the code a little bit, so here is it, anyway.

library(ranger)

friedman <- function(n) {
  X <- matrix(runif(n * 6), nrow = n)
  data.frame(
    X,
    y = 10 * sin(pi * X[, 1] * X[, 2]) + 20 * (X[, 3] - 0.5)^2 +
        10 * X[, 4] + 5 * X[, 5] + rnorm(n)
  )
}

set.seed(42)

n <- 10^5
training <- friedman(n)

n_tst <- 10^5
test <- friedman(n_tst)

rf <- ranger(y ~ ., data = training, quantreg = TRUE)

pred <- predict(rf, data = test, quantiles = c(0.1, 0.9), type = "quantiles")

y_hat <- pred$predictions

Lower <- y_hat[, 1]
Upper <- y_hat[, 2]

mean((Lower <= test$y) & (test$y <= Upper))  # 0.91616

Generic prediction intervals

There is a generic method that I use from time to time that works with any regression method. It is based on two models: the first is a model for the conditional mean. The second one models the absolute residuals of the model and, as such, provides a model for the conditional standard deviation of the conditional response distribution. The prediction interval is then constructed in the same way as you do in your example. Similar to the quantile approach above, it ignores model uncertainty. A second issue is that the loss function used in the second model is rather arbitrary.

library(ranger)

friedman <- function(n) {
  X <- matrix(runif(n * 6), nrow = n)
  data.frame(
    X,
    y = 10 * sin(pi * X[, 1] * X[, 2]) + 20 * (X[, 3] - 0.5)^2 +
      10 * X[, 4] + 5 * X[, 5] + rnorm(n)
  )
}

set.seed(42)

n <- 10^5
training <- friedman(n)

n_tst <- 10^5
test <- friedman(n_tst)

rf_mean <- ranger(y ~ ., data = training)
rf_sd <- ranger(abs(rf_mean$predictions - training$y) ~ ., data = training)

pred_mean <- predict(rf_mean, data = test)$predictions
pred_sd <- predict(rf_sd, data = test)$predictions

Lower <- pred_mean - 1.96 * pred_sd
Upper <- pred_mean + 1.96 * pred_sd

mean((Lower <= test$y) & (test$y <= Upper))  # 89%

Interestingly, the coverage of this method is very similar to the one of the quantile random forest, which might somewhat indicate that Friedman's data is easy to predict?

Michael M
  • 10,553
  • 5
  • 27
  • 43
  • 2
    Hi @Michael M. Thank you very much for your interest in the problem and all the effort you put in your answer. I agree that the statement of the question may be a little confusing, and I'm aware that mathematicaly what they are really constructing is a confidence interval for $\mathbb{E}[Y\mid X=x]$, while a genuine predictive interval would be an $I_x$ with the property $\Pr(Y\in I_x\mid X=x)=1-\alpha$. – Zen Jan 29 '22 at 22:06
  • 2
    Do you have an intuition on why the genuine predictive interval you constructed using Meinshausen's QRF has the wrong coverage even with the gigantic training sample size ($n=100.000$) you used? Thanks for the discussion! – Zen Jan 29 '22 at 23:45
  • Two fantastic comments! Usually, prediction intervals based on empirical quantiles are too narrow as they ignore the uncertainty of the model. For large samples and for quite typical feature values, the approximation should good enough in practice. Now, in the example provided above, the coverage goes into the wrong direction, which I don't understand right know. The conditional distribution of $y$ is symmetric by construction and without huge outliers, so I would have expected quite good coverage. – Michael M Jan 30 '22 at 07:14
  • @Zen I added a second approach and it also gives a lower coverage. So we have three different approaches that all give the same wrong answer... I start to think it might be a property of the synthetic dataset, which is too easy to predict? – Michael M Jan 30 '22 at 07:30
  • Hi @Michael M. Thank you very much for your insights on this subject. – Zen Jan 30 '22 at 12:52
2

Not a direct answer (unless there is a mix-up of confidence vs prediction intervals here), but pulling out the quantiles from the individual trees in your example does produce approximately correct coverage (actually a bit better coverage than expected) for individual level prediction intervals.

library(ranger)

friedman <- function(n, p = 6) {
    X <- matrix(runif(n*p), nrow = n, ncol = p)
    y <- 10*sin(pi*X[, 1]*X[, 2]) + 20*(X[, 3] - 0.5)^2 + 10*X[, 4] + 5*X[, 5] + rnorm(n)
    as.data.frame(cbind(y, X), col.names = c("y", sapply(1:p, function(i) paste0("x_", i))))
}

set.seed(10)

# Setting smaller to be a smidge
# more memory safe on my machine
n <- 10^3
training <- friedman(n)

n_tst <- 10^4
test <- friedman(n_tst)

rf <- ranger(y ~ ., data = training, num.trees = 500, keep.inbag = FALSE)

# Generating prediction intervals from trees directly seems to 
# Work just fine
pred <- predict(rf, data = test, predict.all = TRUE)
lower <- apply(pred$predictions,1,quantile,0.05) #90% interval
upper <- apply(pred$predictions,1,quantile,0.95)
mean((lower <= test$y) & (test$y <= upper)) 
# [1] 0.9655 on my machine, expected 90%

You get these for free with random forests. For very extreme quantiles will need to up the number of trees, but those are typically very difficult to get good coverage for extreme percentiles in real data examples in my experience across projects.

Andy W
  • 15,245
  • 8
  • 69
  • 191
  • 1
    Hy @Andy W. Thank you very much for your answer. I had tried this approach some time ago. Exactly as in your example, I always got effective coverage higher than the nominal level, which seems to be an indication that the predictive intervals produced by this method are "too wide", in a sense. – Zen Jan 29 '22 at 17:55