Bootstrapped confidence intervals for performance metrics of predictive models

Question

This questions is inspired by “Are we confident our model’s recall is precise?” by Ron Itzikovitch.

Suppose there is a labeled dataset $D$, and the goal is to build a predictive model. The dataset is split into a training set $D_1$ and a testing set $D_2$. The former, $D_1$, is used to choose explanatory variables, modeling technique, and hyperparameters (using, for instance, cross-validation), and the latter, $D_2$, to perform the final evaluation with respect to a metric of interest $M$.

We do not want to report just a point estimate of the generalization error but a confidence interval covering the true value of $M$ with a predefined probability. To this end, bootstrap is considered.

In the usual scenario with bootstrapped confidence intervals for a metric, the model is retrained on a number of bootstrap samples and evaluated on the corresponding remaining parts. In the scenario described above, the model has been constructed using $D_1$, and a point estimate can be computed using $D_2$. My question is, Which of the below make sense, if any?

Keep the training set fixed to $D_1$ and bootstrap testing sets using $D_2$
Bootstrap training sets using $D_1$ and keep the testing set fixed to $D_2$
Bootstrap training sets using $D_1$ and bootstrap testing sets using $D_2$

What kind of interval do we get in each case? How do they compare with the traditional bootstrapping (training on the bootstrap sample and evaluating on the remaining part)?

score 1 · Accepted Answer · answered Nov 01 '19 at 12:55

the goal is to build a predictive model.

I read this as: a model that is then actually used for prediction, and we need to know the performance of exactly that model*

Independent Test Set or Hold Out Testing

Now, if you set up is that you have a training set that is used to build your model, and once that model is finalized its performance is evaluated with with a properly independent test set.

In that case, as the model is fixed, we need to account only for the variance uncertainty due to the limited number of tested cases - as always, if the performance estimate is based on measuring more cases, the uncertainty will be lower.

Thus, bootstrap your figure of merit from the test results.

Figures of merit that are proportions (0/1 loss, e.g. accuracy, precision, recall, sensitivity, ...) follow a binomial distribution, so you can also directly calculate confidence intervals this way. This is particularly useful as you can do that beforehand as back-of-the-envelope calculation to check whether your experiment can possibly result in a sufficiently narrow confidence interval for your figure of merit to be of practical use.

We've outlined such approaches in: Beleites, C. and Neugebauer, U. and Bocklitz, T. and Krafft, C. and Popp, J.: Sample size planning for classification models. Anal Chim Acta, 2013, 760, 25-33.
DOI: 10.1016/j.aca.2012.11.007
accepted manuscript on arXiv: 1211.1323

Resampling validation: Cross Validation, Set Validation, Out-Of-Bootstrap & Co.

Resampling validation takes so-called surrogate models trained on a subset of the data at hand and tests them with the respective cases not used for that surrogate model's training. This is typically done for many surrogate models, and the test results are pooled and used as approximation for the performance of the final model which is trained with the same algorithm but on the whole data set.

In this case, the situation is more complex as we have to take into account:

Bias: as the surrogate models are trained on smaller subsets, they are usually a bit worse than the final model: this is the root cause of the slight pessimistic bias of resampling validation. Your confidence interval will be off due to this bias.
$k$-fold CV with not too small $k$ usually has low bias, while the bias of out-of-bootstrap can be more substantial and I've seen .632-bootstrap having optimistic bias.

However, depending on the application question behind this, this bias may not be too bad: I've been working a lot developing models for clinical diagnostic questions. In that case, I use cross validation (low but pessimistic bias) and can then say that my confidence interval will be a bit too conservative - which in this case is far more acceptable than a possibly overoptimistic estimate.

With some experience, you may be able to get an idea of the order of magnitude for your data. We've studied this for small n large p situations that are typical in my field: Beleites, C.; Baumgartner, R.; Bowman, C.; Somorjai, R.; Steiner, G.; Salzer, R. & Sowa, M. G. Variance reduction in estimating classification error using sparse datasets, Chemom Intell Lab Syst, 79, 91 - 100 (2005).
Variance uncertainty due to the limited number of tested cases: this is a bit more tricky now than above: as the results for all cases are pooled, you'd bootstrap test results from all cases. But that includes test cases from multiple surrogate models
and there is also variance uncertainty due to possible model (in)stability, which is caused by the limited number of training cases and possibly by non-determinism in the training algorithm ("variance source 2b").

As this is important information on it's own, you may want to directly measure this with repeated/iterated cross validation or bootstrap-based resampling, see e.g. Beleites, C. & Salzer, R.: Assessing and improving the stability of chemometric models in small sample size situations Anal Bioanal Chem, 2008, 390, 1261-1271.
DOI: 10.1007/s00216-007-1818-6

With such a repeated cross validation or out-of-bootstrap or any of its variants (.632, .632+), your raw test results include both relevant sources of variance. But what we want is the distribution of the figure of merit that pools both sources: $n_t$ tested independent cases and $n_b$ surrogate models.
While I've not quite finished thinking though this, at the moment I bootstrap both $n_b$ out of $n_b$ surrogate models and $n_t$ out of $n_t$ test cases to construct my distribution for the figure of merit.

(I've presented a poster "C. Beleites & A. Krähmer: Cross-Validation Revisited: using Uncertainty Estimates to Improve Model Autotuning" about this a few weeks ago, please do not hesitate to email me [see profile] if you'd like to have a copy)

* As opposed to: a model trained with this training algorithm on a data set (not this data set) of size $n$ of this general population => in that case, a proper estimate of the variance will need multiple data sets, resampling validation cannot estimate it, see Bengio, Y. and Grandvalet, Y.: No Unbiased Estimator of the Variance of K-Fold Cross-Validation Journal of Machine Learning Research, 2004, 5, 1089-1105.

Thank you for your exhaustive answer! It clarifies the thought process that one should have. Regarding the first paper you mentioned and the Bernoulli model, is there an analytical approach to constructing confidence intervals for the F score, which is from 0 to 1 but not a proportion? Or should one go with the bootstrap in this case? — Ivan, Nov 02 '19 at 11:28
I haven't looked into confidence intervals for the F scores (in my field, we hardly use F scores if). However, I'd start by seeing whether Fleiss: Statistical Methods for Rates and Proportions (don't have it here now) has something on the topic. There's also a question here: https://stats.stackexchange.com/q/363382/4598 doing error propagation from Wilson's construction of binomial confidence intervals. — cbeleites unhappy with SX, Nov 03 '19 at 18:28

score 0 · Answer 2 · answered Oct 31 '19 at 10:49

If you are just evaluating predictions of a model on one test set, so no CV than this is a simple problem and you can just treat your model predictions as you would any other variable in order to get some estimates and their CI.

So if you want to get CI using bootstrap, then you don't need to refit the model many times, you just bootstrap the errors on the test set. But you don't even need to do bootstrap, you can just use standard methods. For example, if you want to know CI for your accuracy, you can just get it using a binomial test on the proportion of correctly predicted samples. CI for accuracy is the same as the CI for proportions.

This works, because you are testing predictions of one fixed model. If you don't have a one fixed model, such as if you wan't to evaluate your CV performance, then this will not give you a correct intervals and I don't know what will.

Bootstrapped confidence intervals for performance metrics of predictive models

2 Answers2

Independent Test Set or Hold Out Testing

Resampling validation: Cross Validation, Set Validation, Out-Of-Bootstrap & Co.