Resampling a complete ML Procedure (instead of a Nested CV)

Question

I was wondering whether there is any objection in principle against
resampling a complete ML-Procedure, which involves:

Splitting into 70:30 (Train & Test-Set)
Hyperparameter Tuning with 5 x repeated - 10 Fold Cross-Validation in Train-Set
Training with the optimal hyperparameter over the complete 70% train-set
Predictions about 30% Test Set

What could be major methodological pitfalls with such an approach, and what is to be
considered when writing the code?
Would it be enough to wrap the whole code (currently implemented with R / caret) in a for loop and setting a new seed in every iteration, to gain different random splits in 70% train an 30% test set?

i.e. (pseudo code)

 set.seed(23)
 seeds = [*range(100)]
 for i in seeds:
     0. set.seed(i)
     1. Splitting into 70:30 (Train & Test-Set)
     2. Hyperparameter Tuning with 5 x repeated - 10 Fold Cross-Validation in Train-Set 
     3. Training with the optimal hyperparameter over the complete 70% train-set
     4. Predictions about 30% Test Set
# just realized this is python-ish pseudo code, while my ML is in R :D

Motivation: There several reasons why one might think about this: 1) it could increase reliability of the results 2) you can obtain measures of deviation (or uncertainty) for your performance estimates in the test set. This ultimately (3) could enable a (parametric-)hypothesis test between different data sources.

I guess the difference to a nested cross-validation would be the resampling (drawing with replacement) aspect. Data leakage shouldn't be a problem since hyperparameter in each iteration are estimated / optimized, independent of the 30% Test Set

To me your procedure is doing nested CV. In fact, you are selecting hyperparameters based on a 10-fold coss-validation process (i.e. based on 10 different train/valid partitions) and using a third partition set (test) to compute the prediction error. Why do you say your procedure is not nested CV? — mastropi, May 08 '20 at 13:52
@mastropi: because the outer procedure is set validation rather than cross validation. — cbeleites unhappy with SX, May 08 '20 at 14:00
Correct @cbeleitesunhappywithSX . Is the question maybe to broad? I guess I was generally concerned, that I overlook some major problem that prohibits such a resampled ML procedure — Björn, May 08 '20 at 14:11

score 2 · Answer 1 · answered May 08 '20 at 14:34

IMHO, set validation vs. cross validation of a model is mostly a matter of choice (and it doesn't really matter whether that model internally uses further splits during training).

The IMHO main pitfall is: does your splitting procedure actually achieve statistical independence between the sets? But this the same for set and cross validation.

The same reliability of results can be achieved with set or cross validation. In both cases you need to take more care than is customary - but in slightly different respects.

The underlying difficulty is that we have more than one source of uncertainty.

This has consequences wrt. your motivations:

1) the overall uncertainty will depend on the largest source of uncertainty, reducing the others cannot bring down total uncertainty to any practical extent.
3) hypothesis tests should be constructed in a way that takes care of these multiple sources of error (uncertainty).

For models that are to be used for production, i.e. where we want to estimate the generalization error of the model trained from the sample at hand, both set and cross validation have two sources of random error: the finite number of independent cases that are tested and model instability.
If the generalization error of a model trained from a sample of size $n$ of the population the sample at hand comes from is needed, we have additional uncertainty which cannot be estimated by set or cross validation. If your "comparing sources of data" relates to this, you'll either have to live with an unknown source of uncertainty that hampers your possibilty to draw conclusions or you need to take yet radically different approaches.

I'll continue with the "doable" task of estimating generalization error for a model to be used in production (which happens to be the main "mode" of my model validation work).

We can estimate both model instability and uncertainty due to the finite number of tested cases from set or from cross validation.

To get more reliable, i.e. less uncertain results wrt. test sample size error, we need to test as many cases as possible.
- Cross validation has the advantage of guaranteeing by construction that after $k$-folds each case has been tested.
- When randomly reserving a fraction of $\frac{1}{k}$ of the cases for testing in a set validation, you'll need more than $k$ iterations to have each available case tested at least once (unless you accidentally hit a series of splits that is equivalent to a run of $k$-fold cross validation).
Model instability is a bit different. Usually we want the models to be stable. Again, we can improve our generalization error estimate by obtaining an estimate from as many different surrogate models as possible. However, in practice we often only want to establish that our overall uncertainty is not the dominating source of error. This often doesn't need that many surrogate models and thus iteratations or repetitions.
- For cross validation, you should repeat, i.e. do several runs of k-fold CV. This is called repeated (aka iterated) k-fold cross validation and is sometimes described as i × k-fold CV.
  I often start with a small number (say, 3) of repetitions. After that, I have a series of 3 estimates from different surrogate models for each of my cases. While an estimate of the variance due to model instability will still be very uncertain with just 3 repetitions, it is still often sufficient to either establish that model instability is not a problem here (after which further repetitions would not help), or that the models are so unstable that I anyways need to go a step back and change the training procedure.
  In the intermediate case and in case one decides to stabilize by model aggregation, one can also decide to go on and do more repetitions.
- For set validation, there will be some overlap in the tested cases almost from the beginning. But again, you need to monitor that you have sufficiently many surrogate models to get both estimates with an (un)certainty that allows you to draw meaningful conclusions.

In the end, set validation may be a bit easier for fine-tuning the number of surrogate models (i.e. you can do 50 repetitions with 1/8th of the data split off, but with 8-fold CV, you need to decide between 6×8-fold i.e. 48 surrogate models and 7×8-fold = 56 surrogate models).

Thank you very much for the elaborate answer :) ! Sorry that it took me so long, I already upvoted and gonna read carefully through it on monday. — Björn, May 10 '20 at 08:32
Ok thank you again for your help! I probably should just open a new question, however as you mention the **3rd Motivation of a hypothesis test**: The primary goal was to test the performance of a model (trained with the steps 1-3 above) and then predict two variations of the test set: A) a test set with all information (all features included) and B) a test set in which some of the features are removed. While thinking about the right hypothesis test, it crossed our minds that it could help to have multiple measures of accuracy (in each test set) to get a Std and a mean. — Björn, May 11 '20 at 11:53

Resampling a complete ML Procedure (instead of a Nested CV)

1 Answers1