Bootstrap for Performance CI with Imputation and Train/Test Split

Question

I am currently performing an analysis in which we are hoping to develop a risk score for a survival outcome using machine learning techniques. Currently, our process is as follows:

Split randomly into training and test data by ID
Use imputation to replace missing values in the training data. Save this imputation scheme, and reuse to replace missing values in the testing data.
Fit models on the training data
Create predictions on the testing data and compute Harrell's c-index

Now, it seems that easiest the way to get a 95% CI for the c-index is to use bootstrapping. However, because of the various steps of our analysis, I am not certain when in the process to perform the resampling procedure. My thoughts are that this could either happen at the very beginning, before data is even split into training and testing, or it could happen directly after the imputation step. Is there a general rule which makes it clear when to perform the resampling in this case?

EDIT: A third option I've seen is resampling ONLY in the test data to avoid the computational expense of having to refit 1000 models. This does seem like the only feasible option for my computational capacity, but I'm sure there are disadvantages to this approach...

EDIT 2: There are fewer than 1000 events in the dataset and around 4000 individuals, so censoring is quite high. I've used glmnet to fit the models, and to my understanding, these use Harrell's c-index as the performance metric when fitting the models. So, I'm not sure if there would be a problem in using another metric to define model performance when the package seems to use the c-index. Also not sure how best to compare something like the elastic net and a survival random forest in this context with something likelihood-based, which was part of the appeal of the c-index.

Two questions. First, how many "events" do you have? Unless there are many thousands, the train/test split might be suboptimal. Second, are you wedded to the C-index for evaluating your models? That can be a useful summary of a single model, but even Harrell finds it [less sensitive than likelihood-based measures](https://stats.stackexchange.com/a/96212/28500) for comparisons among models. Please address those issues by editing the question, as comments are easy to overlook and can get lost. — EdM, Sep 07 '21 at 12:57
I've added a second edit above to clarify these points. Thank you! — NB3, Sep 07 '21 at 14:09
@NB3 This may be a better question for datamethods, Frank's discourse site. — Demetri Pananos, Sep 07 '21 at 14:14

score 0 · Answer 1 · answered Sep 07 '21 at 17:38

Three points here.

First, the train/test split is not a good idea with so few cases. Frank Harrell estimates you need on the order of 20,000 cases for that to be superior to internal validation vis bootstrap. As it's the events that provide the power in survival modeling, maybe you would need that many events, not just that many cases.

See this answer and this thread, among others on this site, and Harrell's resources for how to use the bootstrap to evaluate model overfitting and calibration internally without test/train splits.

Second, the default criterion for cross-validation of survival modeling in glmnet is partial likelihood. A vignette does illustrate use of the alternative, the C-index, which can lead to some confusion about the default. It admittedly might be difficult to compare different types of models with likelihood-based criteria, as Cox models use partial likelihood while parametric models use full likelihood.

The C-index might end up OK with a couple of thousand events, but with smaller data sets I recall seeing very unstable plots of C-index versus penalization factor. So I'd recommend sticking with (partial) likelihood for building models even if it can't readily be used for comparing among model types.

Third, with respect to how to combine imputation and optimism bootstrapping, I'll defer to Stef van Buuren:

It is not yet clear what the best way is to estimate optimism from incomplete data.

The discussion there is in the context of backward stepwise regression, but the considerations of overfitting and model calibration are general. Possibilities enumerated in that section of van Buuren's book are: (1) modeling each of your imputated datasets to see what's due to the missing data; (2) bootstrapping extensively from one imputed set to see what's due to sampling variations; (3) bootstrapping on each of your imputed datasets to evaluate contributions to variability from sampling and missing data; (4) doing imputation on each of a set of bootstrapped samples.

The last of those seems closest to implementing the bootstrap principle: taking each bootstrap sample from your dataset mimics the process of taking your dataset from the underlying population. So repeating the entire modeling process on multiple bootstrapped samples, while evaluating performance against the full dataset, would seem to provide the best estimate of how well modeling from your dataset would work when applied to the underlying population. I don't know, however, if it has superior performance in practice. That section of van Buuren's book contains references to this study on the topic and to some alternate approaches for predictor selection as with LASSO.

Bootstrap for Performance CI with Imputation and Train/Test Split

1 Answers1