131

How can one use nested cross validation for model selection?

From what I read online, nested CV works as follows:

  • There is the inner CV loop, where we may conduct a grid search (e.g. running K-fold for every available model, e.g. combination of hyperparameters/features)
  • There is the outer CV loop, where we measure the performance of the model that won in the inner fold, on a separate external fold.

At the end of this process we end up with $K$ models ($K$ being the number of folds in the outer loop). These models are the ones that won in the grid search within the inner CV, and they are likely different (e.g. SVMs with different kernels, trained with possibly different features, depending on the grid search).

How do I choose a model from this output? It looks to me that selecting the best model out of those $K$ winning models would not be a fair comparison since each model was trained and tested on different parts of the dataset.

So how can I use nested CV for model selection?

Also I have read threads discussing how nested model selection is useful for analyzing the learning procedure. What types of analysis /checks can I do with the scores that I get from the outer K folds?

Amelio Vazquez-Reina
  • 17,546
  • 26
  • 74
  • 110

4 Answers4

105

How do I choose a model from this [outer cross validation] output?

Short answer: You don't.

Treat the inner cross validation as part of the model fitting procedure. That means that the fitting including the fitting of the hyper-parameters (this is where the inner cross validation hides) is just like any other model esitmation routine.
The outer cross validation estimates the performance of this model fitting approach. For that you use the usual assumptions

  • the $k$ outer surrogate models are equivalent to the "real" model built by model.fitting.procedure with all data.
  • Or, in case 1. breaks down (pessimistic bias of resampling validation), at least the $k$ outer surrogate models are equivalent to each other.
    This allows you to pool (average) the test results. It also means that you do not need to choose among them as you assume that they are basically the same. The breaking down of this second, weaker assumption is model instability.

Do not pick the seemingly best of the $k$ surrogate models - that would usually be just "harvesting" testing uncertainty and leads to an optimistic bias.

So how can I use nested CV for model selection?

The inner CV does the selection.

It looks to me that selecting the best model out of those K winning models would not be a fair comparison since each model was trained and tested on different parts of the dataset.

You are right in that it is no good idea to pick one of the $k$ surrogate models. But you are wrong about the reason. Real reason: see above. The fact that they are not trained and tested on the same data does not "hurt" here.

  • Not having the same testing data: as you want to claim afterwards that the test results generalize to never seen data, this cannot make a difference.
  • Not having the same training data:
    • if the models are stable, this doesn't make a difference: Stable here means that the model does not change (much) if the training data is "perturbed" by replacing a few cases by other cases.
    • if the models are not stable, three considerations are important:
      1. you can actually measure whether and to which extent this is the case, by using iterated/repeated $k$-fold cross validation. That allows you to compare cross validation results for the same case that were predicted by different models built on slightly differing training data.
      2. If the models are not stable, the variance observed over the test results of the $k$-fold cross validation increases: you do not only have the variance due to the fact that only a finite number of cases is tested in total, but have additional variance due to the instability of the models (variance in the predictive abilities).
      3. If instability is a real problem, you cannot extrapolate well to the performance for the "real" model.

Which brings me to your last question:

What types of analysis /checks can I do with the scores that I get from the outer K folds?

  • check for stability of the predictions (use iterated/repeated cross-validation)
  • check for the stability/variation of the optimized hyper-parameters.
    For one thing, wildly scattering hyper-parameters may indicate that the inner optimization didn't work. For another thing, this may allow you to decide on the hyperparameters without the costly optimization step in similar situations in the future. With costly I do not refer to computational resources but to the fact that this "costs" information that may better be used for estimating the "normal" model parameters.

  • check for the difference between the inner and outer estimate of the chosen model. If there is a large difference (the inner being very overoptimistic), there is a risk that the inner optimization didn't work well because of overfitting.


update @user99889's question: What to do if outer CV finds instability?

First of all, detecting in the outer CV loop that the models do not yield stable predictions in that respect doesn't really differ from detecting that the prediciton error is too high for the application. It is one of the possible outcomes of model validation (or verification) implying that the model we have is not fit for its purpose.

In the comment answering @davips, I was thinking of tackling the instability in the inner CV - i.e. as part of the model optimization process.

But you are certainly right: if we change our model based on the findings of the outer CV, yet another round of independent testing of the changed model is necessary.
However, instability in the outer CV would also be a sign that the optimization wasn't set up well - so finding instability in the outer CV implies that the inner CV did not penalize instability in the necessary fashion - this would be my main point of critique in such a situation. In other words, why does the optimization allow/lead to heavily overfit models?

However, there is one peculiarity here that IMHO may excuse the further change of the "final" model after careful consideration of the exact circumstances: As we did detect overfitting, any proposed change (fewer d.f./more restrictive or aggregation) to the model would be in direction of less overfitting (or at least hyperparameters that are less prone to overfitting). The point of independent testing is to detect overfitting - underfitting can be detected by data that was already used in the training process.

So if we are talking, say, about further reducing the number of latent variables in a PLS model that would be comparably benign (if the proposed change would be a totally different type of model, say PLS instead of SVM, all bets would be off), and I'd be even more relaxed about it if I'd know that we are anyways in an intermediate stage of modeling - after all, if the optimized models are still unstable, there's no question that more cases are needed. Also, in many situations, you'll eventually need to perform studies that are designed to properly test various aspects of performance (e.g. generalization to data acquired in the future). Still, I'd insist that the full modeling process would need to be reported, and that the implications of these late changes would need to be carefully discussed.

Also, aggregation including and out-of-bag analogue CV estimate of performance would be possible from the already available results - which is the other type of "post-processing" of the model that I'd be willing to consider benign here. Yet again, it then would have been better if the study were designed from the beginning to check that aggregation provides no advantage over individual predcitions (which is another way of saying that the individual models are stable).


Update (2019): the more I think about these situations, the more I come to favor the "nested cross validation apparently without nesting" approach.

cbeleites unhappy with SX
  • 34,156
  • 3
  • 67
  • 133
  • W.r.t. model selection, if the classifier is unstable, should we choose the one with the median performance amongst the best ones? This choice would be analogous to your suggestion to compare inner performance with outer performance. – dawid Mar 31 '14 at 14:08
  • 2
    @davips: If the models are unstable, the optimization will not work (instability causes additional variance). Choosing the one model with median (or average) performance will not help, though. Instead, if the models are unstable I'd recommend either to go for more restrictive models (e.g. stronger regularization) or to build a model ensemble (which is fundametally different from selecting one model). – cbeleites unhappy with SX Mar 31 '14 at 16:11
  • the ensemble approach is almost a bagging. It seems reasonable to not discard the models. Therefore the outer accuracies (aka out-of-bag error) are useless for model selection. – dawid Mar 31 '14 at 20:31
  • Can someone define the $k$ *"outer"* surrogate models and the data on which they are tested? I am sorry but I am quite confused on this point. – ClarPaul Mar 21 '17 at 21:02
  • @cbeleites: if for example we see that a model is unstable based on nested CV, haven't we made that determination based on the test data (since nested CV uses all the data)? In this case, if we change the modeling process by electing to use a more restrictive model, haven't we then fit that choice to the test set? – sjw May 23 '17 at 20:06
  • 2
    @user99889: please see updated answer. – cbeleites unhappy with SX May 24 '17 at 16:10
  • @clarpaul: bit late, but maybe my answer to another question helps https://stats.stackexchange.com/a/233027/4598 – cbeleites unhappy with SX May 24 '17 at 16:14
  • @cbeleites thank you. Re model instability; how about increasing k? If k is increased, the training set becomes larger, which would make the training more stable. – sjw May 25 '17 at 13:45
  • 1
    @user99889: yes - but don't expect miracles there. If stability is an issue when training with 80 % of the cases (k = 5) it will likely be an issue still with k = 10 i.e. 90 % of n = additional 12.5 % compared to the 80% / k = 5 surrogate models. – cbeleites unhappy with SX May 26 '17 at 19:18
  • 1
    @cbeleites: a related hypothetical. Suppose I decide to search a parameter space c:[1,2,3]. I perform nested CV on my whole dataset and find the performance not so great. I therefore expand my search space to c:[0.5,1,1.5,2,2.5,3,3.5,4]. Have I done something very bad? It seems that I have essentially changed my parameter space (which is a part of the modeling process) based on knowledge gotten from the test data, and therefore need to evaluate on a dataset external to my current dataset? Happy to make this a separate question if you think it's best. – sjw May 31 '17 at 17:23
  • @user99889: I guess a separate question is better: I started an answer, but it's longer than a comment should be and the answer above is already very long. – cbeleites unhappy with SX Jun 01 '17 at 08:27
  • @cbeleites https://stats.stackexchange.com/questions/282954/if-change-parameter-search-space-after-nested-cv-does-it-introduce-optimistic-b – sjw Jun 02 '17 at 12:57
  • can someone please answer my question: https://stats.stackexchange.com/questions/427227/outer-folds-errors-in-nested-cross-validation – Perl Sep 14 '19 at 16:07
  • This is wrong: You are right in that it is no good idea to pick one of the surrogate models. But you are wrong about the reason. Real reason: see above. The fact that they are not trained and tested on the same data does not "hurt" here. -- Different part of the dataset might lead to very different performance depending on the fortuitous bias of the sampling scheme. – Albert James Teddy May 18 '20 at 13:31
  • @AlbertJamesTeddy: (quotation marks around the quote would have helped a lot). Different performance for different splits can of course happen, that is the training procedure may not yield *stable* models for the data at hand (or more precisely, for subsamples of 1-1/k of the data at hand). But: we can detect such instability (e.g. by using repeated k-fold in the outer loop, or by checking the k model parameter sets yielded by the inner loop) and thus correctly conclude that the model is not fit for purpose. This is no different from the verification detecting that the models yielded by the... – cbeleites unhappy with SX May 18 '20 at 14:07
  • 1
    ... training procedure is not fit for purpose for whatever other reason. What would hurt is if we'd let models go out into production that are not fit for purpose. (It is possible to implement stability checks already in the hyperparameter optimization. But that's a design decision for the training procedure. It fine to decide against this and accept the increased risk of the result being unstable. IMHO one can take all sorts of shortcuts [apply heuristics] in training, as long as there is an honest and state-of-the-art verfication and validation that will show if the decisions were bad) – cbeleites unhappy with SX May 18 '20 at 14:13
34

In addition to cebeleites excellent answer (+1), the basic idea is that cross-validation is used to assess the performance of a method for fitting a model, not of the model itself. If you need to perform model selection, then you need to perform that independently in each fold of the cross-validation procedure, as it is an integral part of the model fitting procedure. If you use a cross-validation based model selection procedure, this means you end up with nested cross-validation. It is helpful to consider the purpose of each cross-validation - one is for model selection, the other for performance estimation.

I would make my final model by fitting the model (including model selection) to the whole dataset, after using nested cross-validation to get an idea of the performance I could reasonably expect to get from that model.

Dikran Marsupial
  • 46,962
  • 5
  • 121
  • 178
  • 3
    Why do you need to `get an idea of the performance`? – dawid Mar 31 '14 at 14:13
  • 1
    @davips Generally if a statistical method is going to be used for some practical purpose, then the users will often want to have some idea of how well it works (e.g. medical screening test). Also if you are developing a machine learning algorithm then it is useful to have an unbiased estimate of how well it performs compared with competing methods. It is also a useful means of validating whether the method actually works (which is invalidated if the cross-validation is used both to select parameters and estimate performance). – Dikran Marsupial Mar 31 '14 at 16:35
  • 6
    So to actually decide what parameter to use in the final model you would do the inner loop once? So if the inner loop was 10fold validation you would hold out 1/10 of the data train each model repeat this 10 times and then pick the parameter value with the smallest average error? Then retrain the model with that parameter value on the whole data set? – emschorsch Oct 04 '15 at 00:31
  • 2
    Yes, that is correct. r – Dikran Marsupial Oct 05 '15 at 07:19
  • (+1) I am thinking of re-organizing the ambiguous [nested] tag: see https://stats.meta.stackexchange.com/questions/4306. Do you think we could use a [nested-cross-validation] tag in addition to the existing [cross-validation], or do you think it's not really necessary and we are fine using [cross-validation] alone? (If you have something to say, maybe it's better if you comment over there on Meta; I will erase this off-topic comment after some time. Thanks.) – amoeba Apr 04 '17 at 08:54
  • So am I correct in thinking that when the final model is used to make predictions on 'real' data (in a production setting, rather than benchmark experiments), the inner loop still needs to be run to pick which of the candidate models gets used to predict from? What if I need (1) to pick a good model from a wider model set, (2) to know how well this model will perform on unseen data, and (3) to be able to rapidly predict from new data, without the computational cost of a cross validation - is there some way to meet all my needs? – jay Apr 29 '17 at 03:16
  • @Dikran Marsupial Do you mean that: 1) Running a first CV for model selection (inner loop); 2) Running a second CV, using a different split, for evaluating model performance (outer loop); 3) Fitting the model on the whole dataset to estimate parameters (and their confidence intervals) is a correct procedure? Because this is also what I would like to do. – Federico Tedeschi Jul 06 '17 at 11:24
  • Basing on the paper here: http://file.scirp.org/pdf/OJS_2015080511003128.pdf I would use repeated half-half split for model selection, and leave-one-out CV for model-performance evaluation. – Federico Tedeschi Jul 06 '17 at 12:35
  • 2
    @FedericoTedeschi The cross-validations need to be nested, rather than merely a different split in order to get an unbiased performance estimator (see section 5.3 of my paper http://jmlr.csail.mit.edu/papers/volume11/cawley10a/cawley10a.pdf). Generally I only use LOOCV for model selection for models where it can be computed efficiently and would use bootstrapping/bagging for models small datasets (with the OOB error replacing the outer cross-validation). – Dikran Marsupial Jul 10 '17 at 08:21
  • After doing nested cross-validation, is there any way to quantify uncertainty of predictions on new test data? – Austin Dec 29 '17 at 23:44
  • @DikranMarsupial, given your last paragraph, couldn't one then run nested CV to get idea of the performance of each method under consideration (SVM, trees, etc.) and choose the one with best performance, to then do plain CV on the whole dataset to choose the hyperparameters for such method (and then fit parameters with whole dataset)? Would this introduce overfitting too? If so, would it be less significant than the overfitting from doing plain CV? It is still quite unclear to me how to choose the method (not the hyperparameters). NB: I only have 160 datapoints (~10 predictors; binary classf.) – Coca Mar 25 '20 at 17:07
  • @Coca The nested cross-validation is used to avoid the bias in performance estimate caused by over-fitting the model selection criteria when tuning the hyper-parameters, so it doesn't get rid of the over-fitting, just compensates for it. However, if you use nested cross-validation to choose the model, then the outer-cross-validation is no longer completely unbiased as it has been used to make a choice about the final model. Fortunately this bias is likely to be fairly small, but for small datasets it can cause difficulties e.g. when selecting a kernel. – Dikran Marsupial Apr 03 '20 at 12:15
9

I don't think anyone really answered the first question. By "Nested cross-validation" I think he meant combining it with GridSearch. Usually GridSearch has CV built in and takes a parameter on how many folds we wish to test. Combining those two I think its a good practice but the model from GridSearch and CrossValidation is not your final model. You should pick the best parameters and train a new model with all your data eventually, or even do a CrossValidation here too on unseen data and then if the model really is that good you train it on all your data. That is your final model.

anselal
  • 91
  • 1
  • 2
  • 5
    to clarify, in python scikit-learn, `GridSearchCV(refit=True)` does actually refit a model on the FULL data using the best parameters, so that extra step is not necessary. [See docs](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) – Paul May 11 '18 at 22:18
  • You are right about the refit option. I was just stating tbe obvious !! – anselal May 13 '18 at 11:10
  • "the model from GridSearch is not your final model". But my points is that the grid search model with refit=True **is** the final model. Do you mean you and I are on the same page? But then I still don't see where the nesting happens in Grid search with CV. It seems like a single layer of CV to me (eg, 5-fold CV in grid search is a single layer of CV). – Paul May 14 '18 at 22:51
  • We are on the same page about the refit. But with nested CV we mean that you create another CV loop outside your GridSearch, leaving some data out of the training and testing your final-final model to see if it generalises (makes good predictions on unknown data) – anselal May 16 '18 at 05:27
1

As was already pointed out by the answer of cebeleites, inner and outer CV loop have different purposes: the inner CV loop is used to get the best model, the outer CV loop can serve different purposes. It can help you to estimate in a more unbiased way the generalisation error of your top performing model. Additionally it gives you insights into the "stability" of you inner CV loop: are the best performing hyperparameters consistent with regard to the different outer folds? For this information you pay a high price because you are repating the optimization procedure k-times (k-Fold outer CV). If your goal is only to estimate the generalization performance, I would consider another way described below.

According to this paper from Bergstra and Bengio: Random Search for Hyper-Parameter Optimization (4000 citations, as of 2019):

Goal: make a hyperoptimization to get the best model and report / get an idea about its generalization error

Your available data is only a small portion of a generally unknown distribution. CV can help by giving you a mean of expectations rather than a single expectation. CV can help you in choosing the best model (the best hyperparameters). You could also skip CV here at the cost of fewer informations (mean of expectation on different datasets, variance).

At the end you would choose the top performing model out of your inner loop (for example random search on hyperparameters with / without CV).

Now you have your "best" model: it is the winner of the hyperoptimization loop.

In practice there will be several different models that perform nearly equally good. When it comes to report your testing error, you must be careful:

"However, when different trials have nearly optimal validation means, then it is not clear which test score to report, and a slightly different choice of λ [single fixed hyperparameter set] could have yielded a different test error. To resolve the difficulty of choosing a winner, we report a weighted average of all the test set scores, in which each one is weighted by the probability that its particular λ(s) is in fact the best."

For details, see the paper. It involves calculating the test error of each model you evaluated in the hyperoptimization loop. This should be cheaper than a nested CV!

So: this technique is an alternative to estimate generalization errors from a model selected out of a hyperoptimization loop!

NB: in practice, most people just do a single hyperoptimization (often with CV) and report the performance on the test set. This can be too optimistic.

kradant
  • 11
  • 3