Bootstrapping estimates of out-of-sample error

Question

I know how to use bootstrap re-sampling to find confidence intervals for in-sample error or R2:

# Bootstrap 95% CI for R-Squared
library(boot)
# function to obtain R-Squared from the data 
rsq <- function(formula, data, indices) {
  d <- data[indices,] # allows boot to select sample 
  fit <- lm(formula, data=d)
  return(summary(fit)$r.square)
} 
# bootstrapping with 1000 replications 
results <- boot(data=mtcars, statistic=rsq, 
     R=1000, formula=mpg~wt+disp)

# view results
results 
plot(results)

# get 95% confidence interval 
boot.ci(results, type="bca")

But what if I want to estimate out-of-sample error (somewhat akin to cross-validation)? Could I fit a model to each boostrap sample, and then use that model to predict for each other bootstrap sample, and then average the RMSE of those predictions?

This sounds somewhat akin to an older technique called CV5. You divide the sample into 5 parts. You fit on 1-4, and then use the 5th as a holdout. You then combined the error terms. The choice of 5 was arbitary; this was often used where "split half" would have resulted in the model being fit on too few observations. The bootstrapping will just allow you to do more variations than CV5, but then computers are more powerful now as well. — zbicyclist, Oct 06 '11 at 17:52
@zbicylist: so is the bootstrap traditionally used to estimate "in-sample" statistics, while cross-validation is traditionally used to estimate "out-of-sample" statistics? — Zach, Oct 06 '11 at 18:11
Yes. I misread this "model to each boostrap sample, and then use that model to predict for each other bootstrap sample" as "model each bootstrap sample, and then use that model to predict the non-bootstrapped observations". I'd do it the second way, so there's no contamination between the observations you model and the ones you predict. — zbicyclist, Oct 13 '11 at 13:53
This form of bootstrapping isn't guaranteed to work. If the true predictive ability of the model is zero, then you don't get asymptotic normality of the quantity being bootstrapped, and the bootstrap confidence intervals above are not valid confidence intervals. With close to zero predictive ability, in finite sample sizes you still get badly-miscalibrated intervals. — guest, Jan 24 '12 at 07:19
Lookup "leave one out bootstrap" in EOSL: https://web.stanford.edu/~hastie/ElemStatLearn/ — JohnRos, Sep 20 '20 at 05:54

score 10 · Answer 1 · answered May 15 '13 at 14:16

10

This calls for the standard Efron-Gong "optimism" bootstrap. In R you can do this:

require(rms)
# Allow age to interact with sex and age and BP to have nonlinear effects
# using restricted cubic splines (5 and 4 knots)
f <- ols(y ~ rcs(age,5)*sex + rcs(blood.pressure,4), x=TRUE, y=TRUE)
validate(f, B=300)

This will give you the bootstrap overfitting-corrected estimate of $R^2$, MSE, and other indexes. To get a bootstrap overfitting-corrected calibration curve (estimate of relationship between $\hat{Y}$ and $Y$), run plot(calibrate(f, B=300)).

This type of bootstrap estimates the likely future performance of the final model on new subjects from the same "stream" of subjects. Some observations are duplicated, triplicated, etc., and "training" and "test" datasets overlap during the bootstrap. The bootstrap provides a highly competitive estimate of future performance, along the lines of 100 repeats of 10-fold cross-validation.

answered May 15 '13 at 14:16

Frank Harrell

74,029
5
148
322

Dr. Harrell, Would you suggest this method only for "smaller" sample sizes where there is not the luxury of being able to sample out a test set? My process for what Zach is asking is to sample out the test set from the beginning and use the training set for model selection, then fit the full model on the training set. In order to calculate a confidence interval around the out-of-sample performance, I will use an ordinary nonparametric bootstrap on the test set and apply the final model to every bootstrap sample - thus creating a performance index distribution. – B_Miner May 15 '13 at 15:16
No, I would use it for all sample sizes. Single training/test splits are often unreliable unless $n > 20000$. – Frank Harrell May 16 '13 at 13:47
@Frank Harrell Efron and Gong described optimism bootstrap for classification problem and 0-1 loss. Though it seems straightforward for regression and quadratic loss, somehow I could not find it in the literature. Can you suggest a reference? Thank you. – AlexGenkin Jul 14 '14 at 16:40
See http://www.citeulike.org/user/harrelfe/article/13264256 and http://www.citeulike.org/user/harrelfe/article/13264735 – Frank Harrell Jul 14 '14 at 16:55

Peter Ellis · Accepted Answer · 2012-01-25T18:55:41.570

The short answer, if I understand the questions, is "no". Out of sample error is out of your sample and no bootstrapping or other analytical effort with your sample can calculate it.

In answer to your comment on whether the bootstrap can be used in checking a model with data outside a training set: two possible interpretations.

It would be fine, and absolutely standard, to fit a model on your training set with traditional methods and then use bootstrapping on the training set to check for things like distribution of your estimators, etc. Then use your final model from that training set to test against the test set.

It would be possible to do a bootstrap-like procedure that involves a loop around:

selecting a subset of the whole sample as your training set
fit a model to that training set of the data
compare that model to the testing set of the remaining data and generate some kind of test statistic that says how well the model from the training set goes against the test set.

And then considering the results of doing that many times. Certainly, it would give you some insight into the robustness of your train/test process. It would reassure you that the particular model you got was not just due to the chance of what ended up in the test set in your one split.

However, it's difficult to say exactly why but there seems to me to be a philosophical clash between the idea of a testing/training division and the bootstrap. Perhaps if I didn't think of it as a bootstrap, but just a robustness test of the train/test process it would be ok...

I was wondering if there was a "bootstrap" equivalent to cross-validation, where each observation may be in the training set more than once, but error is still aggregated out-of-sample. — Zach, Jan 23 '12 at 19:27
I must be confused by the terminology, sorry. When you say error is aggregated out-of-sample do you mean of the whole sample, including those points which are not in the training set? I'd think of that as whole-of-sample, as you still don't have any points from outside your sample (be definition...). — Peter Ellis, Jan 25 '12 at 18:44
sorry-- by out-of-sample, I meant "out of the sample selected for each boostrap replicate." Sort of like what "out of bag" means for random forests. — Zach, Jan 25 '12 at 19:05

score 1 · Answer 3 · answered May 15 '13 at 13:59

Bootstrap is neither in-sample or out-of-sample test.

Consider the bootstrap logic: 1. a statistic is computed in the original sample; 2. a resample is constructed by sampling from the sample with replacement (this sample is considered to be a possible sample from the same population) 3. the same statistic a computed 4. step 2 and 3 are repeated and the distribution of the obtained statistics is then used to construct a confidence interval

Now translate this to the notion of out-of-sample testing, where you estimate a prediction model based on the original sample and then test out-of-sample. The out-of-sample sample should be any sample other than the original sample drawn from the same population.

Resampling with replacement provides you with such a sample, or indeed many such samples should you so wish. Now you can use the original model estimates from your original prediction model to predict outcomes in the new resample(s).

You can now compute a model-fit statistic to see if these predicted outcomes predict a similar share of the variation in the original sample and all of the resamples. Are all results comparatively similar, then overfitting is no issue. Are the results of resamples (significantly) worse than the model fit in the original sample, then you've got evidence of overfitting.

When comparing different training models, you can select the model with the best (average) modelfit in the resamples. More advanced strategies involve the variance of the modelfit, but add little in my opinion.

Best wishes

Bootstrapping estimates of out-of-sample error

3 Answers3

Linked