Computing confidence intervals empirically with final model on the test set

Question

I have a binary classification task where the goal is to build a classifier that outputs the probability of a patient to develop symptoms based on several predictors. My dataset is comprised of 790 examples. I decided to do a stratified 80/20 holdout split (knowing that 20% could potentially lead to biased generalization performance estimates, see point 2 below).

I selected several learning algorithms to try out (e.g RF, XGB and SVC) and I performed nested CV on the training set to get the best algorithm and model (suppose the one having the highest value of a specific metric, could be ROCAUC for example). Then, I performed the hyperparameter tuning on the best model using the whole training set (as suggested here) with the same specifications as the inner model selection of the nested procedure. Finally, I fitted the final model with the best parameters found with this last tuning on the whole training set.

Now, regarding the estimation of the generalization performance, I can predict on the holdout test set and compute metrics such as F2, ROCAUC and so on. As far as I know, holdout random split of the dataset is not advised on small datasets (Harrel et al. 2016) since it could lead to suboptimal performance due to the fact that there is less data for training essentially.

Suppose we still perform holdout split and that I want to compute confidence intervals empirically (not with the normal approx. method). I've recently found the great work of Dr. Sebastian Raschka (github) where he compares different bootstrap strategies, one of which consists in bootstrapping the predictions of the test set (4.1.5), avoiding to retrain the model every time. In 4.1.1, in the setup step for bootstrapping the training set he says:

If you don't tune your model on the training set, you actually don't need a test set for this approach

Since I actually have tuned my model on the training set via nested CV, I though the only viable way to computing confidence intervals was by using test set method.

What are your thoughts ?

Can I compute confidence intervals by bootstrapping the training set even after I used it to tune the hyperparameters ?
In this case, wouldn't the test set be "useless", since the generalization estimate compute on the test set would be biased ?
What I could do to address this last point is to perform the holdout split several times, let's say 200 times, each time with a different seed, train the final model obtained at the end of the procedure, computing predictions on the test set and bootstrapping those predictions. At the end I have 200 "bootstrap scores" that I can take the average of. Would this procedure be correct ?

Since my dataset is pretty small, can I avoid splitting in training and test set ? In other words, can I perform algorithm-and-model selection via nested CV and model evaluation + computing confidence intervals all on the same dataset without incurring in biased estimates ?

If you can point me out to any peer-reviewed literature it would be awesome.

Computing confidence intervals empirically with final model on the test set

0 Answers0