10

Scenario: I'm trying to build a random forest regressor to accelerate probing a large phase space. I'm using python/scikit-learn to perform the regression, and I'm able to obtain a model that has a reasonably low cross-validation error on known data split into training/test sets.

Now I'd like to begin asking my model how confident it is (I'm a bit confused about the difference between confidence intervals and prediction intervals). Currently I know how to return some estimator $y$ given data $x$, and now I'd like to be able to get a measure of uncertainty. If the model is sufficiently uncertain, I'd like it to let me know so that I can add that case to the training set.

My impression is that random forests are an ensemble methods. We grow $N$ decision trees, and then our predictor is given by: $$ y = \sum_i^N w_iT_i\left(x\right) $$ Where $w_i$ is some weight and $T_i(x)$ is the value predicted by the $i$-th tree.

Perhaps the simplest (and by no means most promising) approach would be to try to take something like a variance:

$$ \sigma^2 = \sum_i^N \left(T_i(x) - y(x)\right)^2 $$

This guide seems to suggest we can do quantile analysis using each tree as an observation. Unfortunately I'm not sure I follow that logic, because it seems to me that even for data points that were in the training set that there should be some variance among the trees.

This post seems to mention the flaws in the above methods, however I unfortunately can't really follow what they suggest we do instead (as my skills in R aren't what they could be). Can someone weigh in on whether this is the train I should be following, and perhaps help me understand what's going on? I can't understand where to work in assumptions about the sample. Seeing similar code in Python would likely be really helpful to me.

This paper also seems like it might be useful, but the terms/notations are entirely outside of my specialty, and its essentially unreadable to me right now.

While I've worked in statistics on an application level, it's been a while since my last formal class. So part of the barrier to entry for me is terminology/symbol use.

Edited to provide more detail.

Andrew
  • 101
  • 1
  • 4
  • I've been wondering about this as well. I don't know what is standard practice. One thing I've thought about is doing a k-fold cross-validation and using the predicted values for each of the k-folds to create a distribution for the predicted values. From there, calculate the quantiles from that distribution. Kind of like doing a bootstrap. I think that approach would be slightly different from the the guide you mentioned. – Michael Webb Sep 22 '17 at 14:51
  • Thanks for an additional suggestion. I'm currently trying to play around with the quantile approach I linked to see if that works as I'd hoped (I can't say I understand the theory terribly well, but it makes verifiable/falsifiable predictions). Doing a k-fold cross validation the way you mentioned would (I think) be similar in principle to running different experiments designed to reach the same conclusion - a totally reasonable approach. However, due to a quirk in my current setup (that may be an issue long-run) my training set is rather static. I'll see if I can work around that. – Andrew Sep 22 '17 at 22:23

2 Answers2

4

As far as I know, the uncertainty of the RF predictions can be estimated using several approaches, one of them is the quantile regression forests method(Meinshausen, 2006), which estimates the prediction intervals. Other methods include U-statistics approach of Mentch & Hooker (2016) and monte carlo simulations approach of Coulston (2016).

JonathanV
  • 41
  • 3
1

The problem of constructing prediction intervals for random forest predictions has been addressed in the following paper:

Zhang, Haozhe, Joshua Zimmerman, Dan Nettleton, and Daniel J. Nordman. "Random Forest Prediction Intervals." The American Statistician,2019.

The R package "rfinterval" is its implementation available at CRAN.

Installation

To install the R package rfinterval:

#install.packages("devtools")
#devtools::install_github(repo="haozhestat/rfinterval")
install.packages("rfinterval")
library(rfinterval)
?rfinterval

Usage

Quickstart:

train_data <- sim_data(n = 1000, p = 10)
test_data <- sim_data(n = 1000, p = 10)

output <- rfinterval(y~., train_data = train_data, test_data = test_data,
                     method = c("oob", "split-conformal", "quantreg"),
                     symmetry = TRUE,alpha = 0.1)

### print the marginal coverage of OOB prediction interval
mean(output$oob_interval$lo < test_data$y & output$oob_interval$up > test_data$y)

### print the marginal coverage of Split-conformal prediction interval
mean(output$sc_interval$lo < test_data$y & output$sc_interval$up > test_data$y)

### print the marginal coverage of Quantile regression forest prediction interval
mean(output$quantreg_interval$lo < test_data$y & output$quantreg_interval$up > test_data$y)
``` 
xiaolongmao
  • 71
  • 1
  • 2
  • 1
    Welcome to the site, @xiaolongmao.You may want to take our [tour]. Please do not post identical answers to multiple threads. Try to customize your answers to the specific question on each thread. If you have a case where you really believe that an identical answer completely answers the question, that implies the question is a duplicate. When you reach 50 reputation, you can post a comment to the OP. In the interim, you can flag the Q for closing as a duplicate. – gung - Reinstate Monica Aug 17 '19 at 01:08