3

I have a dataset, which contains 10 folds. The authors of the paper have created these 10 folds and in each fold there is a training set $D_{tr}$ and a test set $D_{te}$ (obviously, for each fold $D_{tr} \cup D_{te}$ will result in the full dataset $D$.

I have built a nested-CV model that does the following: It takes the $D_{tr}$ of a fold, evaluates a set of models (for searing best hyperparameters) using 5-fold CV, and returns the best set of hyper-params ($P^*$). Next, I use this parameter set $P^*$ and construct a model $m^*$, train it with $D_{tr}$ and evaluate the trained model on $D_{te}$. After repeating this for all the 10-folds, I have 10 different models and 10 different test estimates (i.e. model and score for each fold). I average the score of 10 folds and note it down.

Procedurally, it looks like this:

$for$ fold $i = 1$ to $10$

$~~~~for$ each hyper-param $P_j$ (let $j \in [N_p]$)

$~~~~~~~~$Evaluate the performance of model $m$ built with $P_j$ using 5CV on $D_{tr}$ (=$iscore_j$)

$~~~~best = \arg \min_j {iscore_1, iscore_2, \ldots, iscore_{N_p}}$

$~~~~P^* = P_{best}$

$~~~~$Build model $m$ using $P^*$ and $D_{tr}$ (call this $m^*$)

$~~~~$Evaluate the performance of $m^*$ on $D_{te}$ (call this $score_i$)

Report $\frac{1}{10}\sum_{i=1}^{10}{score_i}$

Note: The term 'build' means training.

You can see that:

  • In this way, I am not touching the test set of each fold ever and thereby getting an unbiased estimate.
  • Each fold gives a model and a score.
  • But, I am incurring huge computational resource challenges and time.

Is it the way things should be done, or I am missing something that can make things less expensive?

Most related question: The procedure that I have written above is what is described in the accepted answer of this question. It seems this is the standard nested-CV procedure and this makes me happy and sad both.

Coder
  • 155
  • 1
  • 7
  • 1
    Any kind of feature engineering/selection that depends only on the predictors, you can keep outside the CV without inducing bias. If an algorithm converges on an optimum from an initial best guess, you can use results from the whole data-set to improve that guess & speed things up. Plus, of course, CV lends itself naturally to parallelization. – Scortchi - Reinstate Monica Jun 26 '20 at 15:00
  • @Scortchi-ReinstateMonica: Mind to explain a bit more your points 1+2? In my experience, also feature generation based on predictors only can introduce severe bias (e.g. with PCA features), if train/test splits are not done. For initial settings, I'd caution that one needs to know that the particular algorithm in question profits in convergece speed while the result does not differ. – cbeleites unhappy with SX Jun 28 '20 at 15:11

1 Answers1

1

This is a correct description of nested cross validation.

  • First of all, instead of the description with 3 loops outer CV - hyperparameters - inner CV, you can say that you have verification (outer CV loop) and training (hyperparameter optimization and inner CV loop):

    | outer CV     | hyperparameter loop | inner CV |
    | verification |                 training       |          
    

    Within this training, you can exchange the loops:

    | verification |       training                 |          
    | outer CV     | inner CV | hyperparameter loop |
    

    This rearrangement has the advantage that while re-training is always needed for

  • In general, there are several possibilities to speed up inner CV training:

    • Preprocessing steps that are done on each case on its own, i.e. row-operations can be done once before (outside) the cross validation, without risking any bias.

    • The hyperparameter optimization may be considerably sped up depending on the particular characteristics of the training algorithm.

      • For some models like PCA or PLS, the results for lower complexity models (fewer components/latent variables) can be extracted without refitting the whole model from more complex models. Thus, it is sufficient to train for the highest number of components/latent variables only, and save the re-training for less complex models.

      • For other models like SVM, one may detect during the optimization hyperparameter regions that do not need to be evaluated since further increasing/decreasing hyperparameters will not lead to further changes in the model.

      • Some further shortcuts come at a certain risk of overfitting. In training, you can use your expert judgement to decide whether the speedup is worth the risk or not.
        (If the idea wasn't that bright, you'll see that in the verification results)

    • For some algorithms, model updates for removing and adding training cases are less computationally expensive than complete retraining. In particular, there may exist analytical formulations for leave one (row) out.

  • IMHO, You can do whatever you want during training as long as the resulting models undergo an honest validation.
    Since the inner cross validation (hyperparameter tuning) is really part of the model training, you may use whatever heuristic you deem fit to speed up your model training. The result may not be "nested cross validation of plain vanilla $algorithm" any more - but IMHO that doesn't matter as long as the heuristic is good, i.e. you obtain decent models.

  • OTOH, the outer cross validation is for verifying the performance of the models obtained by the full (incl. inner CV) training strategy. Here, no shortcuts should be taken: any bias here would deprive you of the ability to judge whether your training and hyperparameter optimization strategy works well. Also, depending on the application an overoptimistic bias may have serious consequences.

    • Thus no shortcuts should be used here, and $râ‹…k_{outer}$ times computational effort invested into verification by r repetitions of k-fold cross validation of the performance.
    • IMHO, it may even be worth to put the preprocessing steps mentioned above as "can be done before the cross validation" inside the cross validation in order to make sure that no programming mistake causes a data leak.
cbeleites unhappy with SX
  • 34,156
  • 3
  • 67
  • 133