2

Generally, I've been taught that the test set should not be touched by any means, but suppose I have a pipeline that:

  1. First split data into training and test; and
  2. A second split occurs on the training set, which can either be a train/dev split or a k cross-validation to do hyperparameters search;
  3. Then with the optimal hyperparameters obtained in 2. I do the fit on the train set in 1.

Now I would like to assess the "robustness" of this pipeline, should I bootstrap the entire train/test split N times and rerun the pipeline, or it suffices to bootstrap directly from the test set?

1 Answers1

1

You can (and should) bootstrap the train/test split (aka the "outer" split).

the test set should not be touched by any means

The point is that the test data needs to be independent of all data that entered the training process in any way. Hyperparameter tuning (selecting based on intermediate test results on the dev set) is part of the training process.

Not touching (looking into) the test set prevents one important way of violating independence: the test set entering the training process by driving decisions about the training.

However, there are other ways how independence may be violated (by structure in the data), and sometimes setting a test set away [possibly even in a physical fashion] once can be a good way to ensure that independence wrt. to factors that could violate independence.

However, if you obtain the test set by (randomly) splitting the initially available data set, there is nothing that should keep you from repeating this process to assess the stability of the whole procedure: the single test set isn't any more independent of the training data than the repeated test sets are.
If you move to a cross validation procedure for this split, you get what is called nested cross validation, with the outer and inner split both being done via cross validation.

Bootstrapping or rather, out-of-bootstrap testing for the outer split uses drawing with replacement for the train set and is thus not the repeated version of the usual train/test split which uses drawing without replacement.
There is nothing wrong with using this approach, but it should be clearly marked that there is this difference in the sampling procedure which does affect the model.

cbeleites unhappy with SX
  • 34,156
  • 3
  • 67
  • 133