Toy implementation of nested cross-validation: how to determine number of inner and outer folds and how many iterations to run?

Question

I am reading Cawley and Talbot and saw a post on implementation of nested CV (NCV) as well as numerous posts with good answers on the topic (general NCV, training with full dataset after NCV, how to choose model after NCV).

I wrote a script that performs NCV on a 'visible' set after splitting the data into 'visible' and 'hidden' parts. We refit the model to the entire visible set---usually where the modeling process ends---, but then test this model on the magically appearing 'hidden' set. The error from the latter should roughly equal the generalization error estimated by using NCV on the visible set.

There are many knobs to turn that affect the estimation accuracy: eg, the number of inner folds, number of outer folds, number of iterations (of the full process), or the metric. We can change them here and see what happens, but what to do on a real dataset? Can I use this script on a subset of the real dataset to find settings that give accurate estimates and expect it to hold on unseen data?

O/P

MSE :
     Est 3044.6858062+/-69.0466700133
     True 3051.32410389+/-427.344654478
MAE :
     Est 39.822042199+/-0.621534888251
     True 39.2236773233+/-6.39557352798
R2 :
     Est 0.479608389929+/-0.0169472816234
     True 0.463311879751+/-0.113470834353

Update:

The results above are from the diabetes dataset, which has a continuous response variable. I repeated the experiment on the breast cancer dataset, which has a binary response:

pr (area under PR curve) :
     Est 0.986205196083+/-0.00241897261673
     True 0.989708379454+/-0.0107812698088
ll (neg log loss) :
     Est 0.173785240219+/-0.00888222771163
     True 0.164129744679+/-0.0802308030923
roc (area under roc) :
     Est 0.978565008631+/-0.00248749514519
     True 0.984685331718+/-0.0161458827103
bri (brier's score):
     Est 0.0526782015835+/-0.00296152695846
     True 0.0505493562734+/-0.0290771904303

cbeleites unhappy with SX · Accepted Answer · 2017-05-28T17:29:25.770

knobs to turn that affect the estimation accuracy: eg,
the number of inner folds, number of outer folds,

In my experience* the number of folds is not a very crucial parameter for cross validation: the total number of independent cases tested during CV is always the number of independent cases available, and in cases where the learning curve is comparably steep between, say, $.8 n$ for $k = 5$ and $.9 n$ for $k = 10$, this is often accompanied by even larger variance uncertainty.

For nested scenarios, you may argue that if you check in the inner + outer CV that optimization results are reasonably stable, that takes care of a more crucial (and in my experience often neglected) point.

number of iterations (of the full process), or

Although iterations IMHO are important, they are no magic bullet. Iterations will reduce only part of the variance of the CV estimate - the part that is due to model instability, other sources of variance (finite sample size, the particular data set drawn from the ground population of cases for that type of application) stay the same.

Now the important point of the iterations is that they allow you to check the stability of the solution - which IMHO is one very important sanity check on any kind of data driven optimization.

Specifically, iterations in the inner loop can be used to check whether the same case predicted by two (surrogate) models trained on slightly different data with the same hyperparameters yield the same prediction (or how much variance you get between those predictions).
In the outer cross validation, you may want to check that the optimization was stable (i.e. all surrogate models were tuned to use (almost) the same set of hyperparameters.

I'd say that iterations in the inner loop are more crucial - but the outer CV should have enough iterations to demonstrate stability of those predictions as well. And after doing iterations in the inner loop, it may seem weird to do the final characterization on a lower standard.

As for the actual choice, obviously, the more iterations the better - but you may be able to identify important situations already after a few iterations, e.g. that either your models are terribly unstable (and you need to do something totally different) or that you have this "instability variance" $\ll$ variance due to finite numer of independent cases, and more iterations therefore won't help improving the total uncertainty.

the metric.

Here I do see possibilities, but further recommendations need to take into account the actual nature of your application.

As an example, for classification problems, figures of merit that are based on counting correct or misclassifications among (subsets of) test cases are very popular, but you can often do better. These figures of merit are known to suffer from high variance (depending crucially on the absolute number of independent cases tested = in their denominator) which is a problem particularly for the optimization in the inner CV. Proper scoring rules such as Brier's score (mean squared error) can - depending on your model - have much lower variance.

* I'm mostly dealing with small sample sizes of often $\ll 10^2$ independent cases (of which I often have many more or less repeated measurements - but still, number of independent cases is low).

Thank you for such a great answer. I repeated the experiment as a classification problem and posted results. It seems that all of the metrics are relatively stable, but log loss has a high variance wrt the held out ('hidden from view') set, so it's kind of difficult to tell how good the estimate is. This seems to be consistent across experiments. — sjw, May 29 '17 at 13:09

Toy implementation of nested cross-validation: how to determine number of inner and outer folds and how many iterations to run?

1 Answers1