How to find prediction interval when there is >1 response variable

Question

I have been working on a data set where there is more than 1 response variable (multi-output). I used random forest for this model. The data set has 17 predictor variables and 2 output variables. I made the model using scikit learn RandomForestRegressor and made the prediction, but I am stuck at finding the prediction intervals for new data points. I went through papers by Wager (2014) and also on quantile regression forest, I found a library in python named forestci which implements Wager (2014) and finds these intervals. But these papers and libraries don't discuss how we can find one for multiple response variables. I want to know how we can extend the prediction intervals for multi-output variables, so that I can make an uncertainty on my predicted responses. In the model I made, there are 17 predictors and 2 output variables.

S. Wager, T. Hastie, B. Efron. “Confidence Intervals for Random Forests: The Jackknife and the Infinitesimal Jackknife”, Journal of Machine Learning Research vol. 15, pp. 1625-1651, 2014. (pdf)

score 2 · Answer 1 · answered Jun 26 '19 at 07:04

It sounds like you are not interested in marginal prediction intervals, separately for your two outcomes.

You will need a two-dimensional analogue of a prediction interval, i.e., a prediction area. (Not an established term, I made that up right now.) Just as there are different prediction intervals (symmetric, shortest, one-sided, Highest Density Regions), there are many different possible "prediction areas". One easily explained one would be a smallest prediction ellipse.

Since you are using nonparametric approaches, the simplest way would be to sample many bivariate points from your predictive bivariate density, then calculate a smallest ellipse covering a specific portion of those samples. This and this may be helpful.

An alternative to the smallest prediction ellipse would be a two-dimensional Highest Density Region. Hyndman (1996), which I linked to in this earlier answer on HDRs goes into this.

How to find prediction interval when there is >1 response variable

1 Answers1