How to choose a predictive model after k-fold cross-validation?

Question

I am wondering how to choose a predictive model after doing K-fold cross-validation.

This may be awkwardly phrased, so let me explain in more detail: whenever I run K-fold cross-validation, I use K subsets of the training data, and end up with K different models.

I would like to know how to pick one of the K models, so that I can present it to someone and say "this is the best model that we can produce."

Is it OK to pick any one of the K models? Or is there some kind of best practice that is involved, such as picking the model that achieves the median test error?

You might find the answers in the following question helpful: http://stats.stackexchange.com/questions/2306/feature-selection-for-final-model-when-performing-cross-validation-in-machine?rq=1 — BGreene, Mar 15 '13 at 11:13
You will need to repeat 5-fold CV 100 times and average the results to get sufficient precision. And the answer from @bogdanovist is spot on. You can get the same precision of accuracy estimate from the bootstrap with fewer model fits. — Frank Harrell, Mar 15 '13 at 12:33
@Frank Harrell, why do you say 100 repetitions is necessary (I usually use 10 reps on 10 fold), is this a rule of thumb as the OP didn't give any specifics? — BGreene, Mar 15 '13 at 14:47
For 10-fold cv it is best to do $\geq 50$ repeats. More repeats will be needed with 5-fold. These are rules of thumb. A single 10-fold cv will given an unstable answer, i.e., repeat the 10 splits and you get enough of a different answer to worry. — Frank Harrell, Mar 15 '13 at 16:58
Almost an exact duplicate: http://stats.stackexchange.com/questions/11602 with lots of worthy answers. Perhaps these threads should be merged but I am not sure in which direction. Both have accepted answers that are very good. But the other one is older and has more views/upvotes so it might make sense to merge this one into that one. — amoeba, Feb 11 '16 at 17:59

score 303 · Accepted Answer · edited Aug 15 '17 at 20:00

303

I think that you are missing something still in your understanding of the purpose of cross-validation.

Let's get some terminology straight, generally when we say 'a model' we refer to a particular method for describing how some input data relates to what we are trying to predict. We don't generally refer to particular instances of that method as different models. So you might say 'I have a linear regression model' but you wouldn't call two different sets of the trained coefficients different models. At least not in the context of model selection.

So, when you do K-fold cross validation, you are testing how well your model is able to get trained by some data and then predict data it hasn't seen. We use cross validation for this because if you train using all the data you have, you have none left for testing. You could do this once, say by using 80% of the data to train and 20% to test, but what if the 20% you happened to pick to test happens to contain a bunch of points that are particularly easy (or particularly hard) to predict? We will not have come up with the best estimate possible of the models ability to learn and predict.

We want to use all of the data. So to continue the above example of an 80/20 split, we would do 5-fold cross validation by training the model 5 times on 80% of the data and testing on 20%. We ensure that each data point ends up in the 20% test set exactly once. We've therefore used every data point we have to contribute to an understanding of how well our model performs the task of learning from some data and predicting some new data.

But the purpose of cross-validation is not to come up with our final model. We don't use these 5 instances of our trained model to do any real prediction. For that we want to use all the data we have to come up with the best model possible. The purpose of cross-validation is model checking, not model building.

Now, say we have two models, say a linear regression model and a neural network. How can we say which model is better? We can do K-fold cross-validation and see which one proves better at predicting the test set points. But once we have used cross-validation to select the better performing model, we train that model (whether it be the linear regression or the neural network) on all the data. We don't use the actual model instances we trained during cross-validation for our final predictive model.

Note that there is a technique called bootstrap aggregation (usually shortened to 'bagging') that does in a way use model instances produced in a way similar to cross-validation to build up an ensemble model, but that is an advanced technique beyond the scope of your question here.

edited Aug 15 '17 at 20:00

Eric O Lebigot

199
13

answered Mar 15 '13 at 02:53

Bogdanovist

6,059
1
23
28

13

I agree with this point entirely and thought about using all of the data. That said, if we trained our final model using the entire data set then wouldn't this result in overfitting and thereby sabotage future predictions? – Berk U. Mar 15 '13 at 03:21
27

No! Overfitting has to do with model complexity, it has nothing to do with the amount of data used to train the model. Model complexity has to do with the method the model uses, not the values its parameters take. For instance whether to include x^2 co-efficients as well as x co-efficients in a regression model. – Bogdanovist Mar 15 '13 at 04:33
27

@Bogdanovist: I rather say that overfitting has to do with having too few training cases for too complex a model. So it (also) has to do with numbers of training cases. But having more training cases will reduce the risk of overfitting (for constant model complexity). – cbeleites unhappy with SX Mar 15 '13 at 12:24
2

@Bogdanovist I totally agree with your points! Can you recommend some paper (or book chapter) that explains this ? – lucacerone Feb 07 '14 at 16:07
2

I believe you've given me the missing piece of the puzzle! Thanks so much! – Joey Carson Apr 22 '16 at 17:39
4

@Bogdanovist `For that we want to use all the data we have to come up with the best model possible.` - When doing a grid search with K-fold cross validation, does this mean you would use the best params found by grid search and fit a model on the entire training data, and then evaluate generalization performance using the test set? – arun Jul 15 '16 at 22:17
7

@arun, if you've used k-fold cross validation and selected the best model with the best parameters & hyper-parameters, then after fitting the final model over the training set, you don't need to again check for performance using a test set. This is because you've already checked how the model with specified parameters behaved on unseen data. – Sandeep S. Sandhu Sep 19 '16 at 04:29
what test metric would you use to compare two different models, e.g., regression and a neural network? If the regression minimizes least square error, then won't it presumably do better then the neural network on the test data if you use a squared error test metric? – Bluegreen17 Mar 10 '17 at 08:40
1

In my opinion, "For that we want to use all the data we have to come up with the best model possible. " is exactly why we would have a overfitting problem. – Allen Jun 15 '17 at 23:08
1

If we train on whole Data then what do we report as test error or performance error then ? Do we report the one we obtained on cross validation as the performance measure of the selected model ? – Siddharth Shakya Aug 27 '17 at 09:53
1

I believe the question raised by @arun is meaningful, while I disagree with the answer by: cross validation helps us choosing among a set of models and it should be performed on a training set. To evaluate whether the best model we have chosen is predictive, this should be tested on an independent set, the test set, on which cross validation was not performed. See, for example: http://scikit-learn.org/stable/modules/cross_validation.html – fabiob May 04 '18 at 12:33
2

@Bogdanovist Thanks for the useful answer. In some application having the validation set is crucial in model training for example for earlier stopping. In such cases, the final training (on the whole data) may not be done properly with the lack of a validation set. This is the case for example in training deep neural network or GBM where early stopping is applied. So my question is that how to train the model on the whole data while we wouldn't have any criteria to know where to stop the training (which can lead to over-fitting)? – M.Reza Jun 25 '18 at 02:16
1

@Bogdanovist Secondly, I have seen cases where the prediction on a test set of the trained model on each fold is averaged together to get the final predictions, based on your answer, why doesn't this make sense? – M.Reza Jun 25 '18 at 02:16
1

I have the same wonder with @M.Reza. If we using the whole data set to re-train model, it means we lack the valid set. In some cases, like gradient boosting, without valid set, the model will seriously be overfitting. Any suggestions ? Thanks – Catbuilts May 08 '19 at 07:11
1

Bogdanovist 's answer is correct for the general case; the CV average error is the best estimate of the model trained on all data. There's no reason to think think result on a validation set is a better estimate, and in fact it may be less accurate for smaller datasets due to high variance. The issues raised by @M.Reza and Catbuilts are the exception to the rule, only because the logic of early stopping methods requires an independent valid set. – Alecg_O Jul 11 '19 at 17:32
@ Bogdanovist - Can we consider different hyper parameter settings as different models. E.g. NN with different hidden layers as different models or Decision tree with different heights as different models? – KGhatak Dec 20 '19 at 19:51
Then what do we generally mean by the term *hypothesis*? – AlwaysLearning Dec 26 '19 at 16:59
When doing hyper-parameter search using Random Search and for each parameter set do a 5-fold CV, then there might be a problem. Because in each step in Random Search we only train on 80% of the data. Based on that we find an optimal learning rate (for example). But that learning rate might be too low/high when re-training the "best" model on 100% of the data. Does that make sense? – Simon Hessner Sep 06 '20 at 13:22

score 41 · Answer 2 · edited Aug 15 '17 at 20:01

Let me throw in a few points in addition to Bogdanovist's answer

As you say, you train $k$ different models. They differ in that 1/(k-1)th of the training data is exchanged against other cases. These models are sometimes called surrogate models because the (average) performance measured for these models is taken as a surrogate of the performance of the model trained on all cases.

Now, there are some assumptions in this process.

Assumption 1: the surrogate models are equivalent to the "whole data" model.
It is quite common that this assumption breaks down, and the symptom is the well-known pessimistic bias of $k$-fold cross validation (or other resampling based validation schemes). The performance of the surrogate models is on average worse than the performance of the "whole data" model if the learning curve has still a positive slope (i.e. less training samples lead to worse models).
Assumption 2 is a weaker version of assumption 1: even if the surrogate models are on average worse than the whole data model, we assume them to be equivalent to each other. This allows summarizing the test results for $k$ surrogate models as one average performance.
Model instability leads to the breakdown of this assumption: the true performance of models trained on $N \frac{k - 1}{k}$ training cases varies a lot. You can measure this by doing iterations/repetitions of the $k$-fold cross validation (new random assignments to the $k$ subsets) and looking at the variance (random differences) between the predictions of different surrogate models for the same case.
The finite number of cases means the performance measurement will be subject to a random error (variance) due to the finite number of test cases. This source of variance is different from (and thus adds to) the model instablilty variance.

The differences in the observed performance are due to these two sources of variance.

The "selection" you think about is a data set selection: selecting one of the surrogate models means selecting a subset of training samples and claiming that this subset of training samples leads to a superior model. While this may be truely the case, usually the "superiority" is spurious. In any case, as picking "the best" of the surrogate models is a data-driven optimization, you'd need to validate (measure performance) this picked model with new unknown data. The test set within this cross validation is not independent as it was used to select the surrogate model.

You may want to look at our paper, it is about classification where things are usually worse than for regression. However, it shows how these sources of variance and bias add up.
Beleites, C. and Neugebauer, U. and Bocklitz, T. and Krafft, C. and Popp, J.: Sample size planning for classification models. Anal Chim Acta, 2013, 760, 25-33.
DOI: 10.1016/j.aca.2012.11.007
accepted manuscript on arXiv: 1211.1323

You and Bogdanovist are in disagreement when you say `picking "the best" of the surrogate models is a data-driven optimization, you'd need to validate (measure performance) this picked model with new unknown data. The test set within this cross validation is not independent as it was used to select the surrogate model.` and he says `But once we have used cross-validation to select the better performing model, we train that model (whether it be the linear regression or the neural network) on all the data.` This is quite common and it is crucial that a standardised approach is specified — jpcgandre, Sep 14 '14 at 01:45
Especially for small datasets where maybe leaving out data from CV is just not feasible but the risks of overfitting your model are also high! References are needed to clarify this issue. — jpcgandre, Sep 14 '14 at 01:47
@jpcgandre: I don't see any disagreement. Bogdanovist explains how to actually calculate the model of choice from the hyperparameters which were selected via cross validation, and I added that after such a selection, the model needs undergo another (outer) independent level of validation. In other words, e.g. a nested validation design: inner validation loop for hyperparameter selection, outer loop for testing the selected models (if you happen to have enough cases, you could also go for an independent test set). — cbeleites unhappy with SX, Sep 14 '14 at 13:50
The inner/outer validation set up is for cross validation known as double or nested cross validation, I've seen it also named cross model validation (http://dx.doi.org/10.1016/j.chemolab.2006.04.021). With independent test set it corresponds to the splitting in three sets: train/(optimization) validation/test (= final validation). If you have so few cases that you cannot leave out data for a second level CV, I'd argue that you should fix your hyperparameters by other means instead of trying to optimize by selecting one of the hyperparameter sets. — cbeleites unhappy with SX, Sep 14 '14 at 14:08
@cbeleites I have a question. Then to get final model parameters, would you take the average of the hyperparameters from each external fold, and retrain the whole dataset using that averaged hyperparameter? Or would doing hyperparameter search in a regular CV, then confirming the stability of this method using repeated nested CV also work? — Michelle, Sep 06 '17 at 09:22
@Michelle: I consider the inner CV as part of the training procedure. And from that point of view, training on the whole data set means that another hyperparameter optimization for the whole data set is done. I'm sure I wrote a longer answer about this some time ago but couldn't find it right now. — cbeleites unhappy with SX, Sep 07 '17 at 18:07

Patrick Ng · Answer 3 · 2018-01-14T01:12:01.930

I found this excellent article How to Train a Final Machine Learning Model very helpful in clearing up all the confusions I have regarding the use of CV in machine learning.

Basically we use CV (e.g. 80/20 split, k-fold, etc) to estimate how well your whole procedure (including the data engineering, choice of model (i.e. algorithm) and hyper-parameters, etc.) will perform on future unseen data. And once you've chosen the winning "procedure", the fitted models from CV have served their purpose and can now be discarded. You then use the same winning "procedure" and train your final model using the whole data set.

score 4 · Answer 4 · answered Oct 21 '18 at 13:39

Why do we use k-fold cross validation?

Cross-validation is a method to estimate the skill of a method on unseen data. Like using a train-test split.

Cross-validation systematically creates and evaluates multiple models on multiple subsets of the dataset. This, in turn, provides a population of performance measures.

We can calculate the mean of these measures to get an idea of how well the procedure performs on average.
We can calculate the standard deviation of these measures to get an idea of how much the skill of the procedure is expected to vary in practice.

This is also helpful for providing a more nuanced comparison of one procedure to another when you are trying to choose which algorithm and data preparation procedures to use.

Also, this information is invaluable as you can use the mean and spread to give a confidence interval on the expected performance on a machine learning procedure in practice.

reference

Let's say I have 2 models A and B with avg_mae_A = 0.78, std_A = 0.032 and avg_mae_B = 0.73, std_B = 0.041. Which model should one select? — spectre, Nov 28 '21 at 14:49

score 3 · Answer 5 · answered Jan 04 '17 at 07:40

It's a very interesting question. To make it clear, we should understand the difference of model and model evaluation. We use full training set to build a model, and we expect this model would be finally used.

K fold cross evaluation would build K models but all would be dropped. The K models are just used for evaluation. and It just produced metrics to tell you how well this model fits with your data.

For example, you choose LinearRegression algo and perform two operation on the same training set: one with 10 fold cross validation, and the other with 20 fold. the regression(or classifier) model should be the same, but Correlation coefficient and Root relative squared error are different.

Below are two runs for 10 fold and 20 fold cross validation with weka

1st run with 10 fold

=== Run information ===
Test mode:    10-fold cross-validation
...
=== Classifier model (full training set) ===


Linear Regression Model  <---- This model is the same

Date = 844769960.1903 * passenger_numbers -711510446549.7296

Time taken to build model: 0 seconds

=== Cross-validation ===  <---- Hereafter produced different metrics
=== Summary ===

Correlation coefficient                  0.9206
Mean absolute error                35151281151.9807
Root mean squared error            42707499176.2097
Relative absolute error                 37.0147 %
Root relative squared error             38.9596 %
Total Number of Instances              144

2nd run with 20 fold

=== Run information ===
...
Test mode:    20-fold cross-validation

=== Classifier model (full training set) ===


Linear Regression Model   <---- This model is the same

Date = 844769960.1903 * passenger_numbers -711510446549.7296

Time taken to build model: 0 seconds

=== Cross-validation ===  <---- Hereafter produced different metrics
=== Summary ===

Correlation coefficient                  0.9203
Mean absolute error                35093728104.8746
Root mean squared error            42790545071.8199
Relative absolute error                 36.9394 %
Root relative squared error             39.0096 %
Total Number of Instances              144

score 1 · Answer 6 · answered Nov 05 '14 at 16:59

I am not sure the discussion above is entirely correct. In cross-validation, we can split the data into Training and Testing for each run. Using the training data alone, one needs to fit the model and choose the tuning parameters in each class of models being considered. For example, in Neural Nets the tuning parameters are the number of neurons and the choices for activation function. In order to do this, one cross-validates in the training data alone.

Once the best model in each class is found, the best fit model is evaluated using the test data. The "outer" cross-validation loop can be used to give a better estimate of test data performance as well as an estimate on the variability. A discussion can then compare test performance for different classes say Neural Nets vs. SVM. One model class is chosen, with the model size fixed, and now the entire data is used to learn the best model.

Now, if as part of your machine learning algorithm you want to constantly select the best model class (say every week), then even this choice needs to be evaluated in the training data! Test data measurement cannot be used to judge the model class choice if it is a dynamic option.

score 0 · Answer 7 · answered Oct 11 '21 at 10:10

Even belatedly, let me throw in my 2 drachmas. I am of the opinion that you can train a model using KFold cross-validation as long as you do it inside a loop. I am not posting everything, just the juice. Mind you that the functions below can be used with any model, be it an sklearn predictor/pipeline, a Keras or a Pytorch model after affecting the necessary array/tensor conversions is some cases. Here goes:

#Get the necessary metrics (in this case for a severely imbalanced dataset)

def get_scores(model,X,y):
  pred=np.round(model.predict(X))
  probs=model.predict_proba(X)[:,1]
  precision,recall,_=precision_recall_curve(y,probs)
  accu=accuracy_score(y,pred)
  pr_auc=auc(recall,precision)
  f2=fbeta_score(y,pred,beta=2)
  return pred,accu,pr_auc,f2


#Train model with KFold cross-validation

def train_model(model,X,y):
  accu_list,pr_auc_list,f2_list=[],[],[]
  kf=StratifiedKFold(5,False,seed)
  for train,val in kf.split(X,y):
    X_train,y_train=X[train],y[train]
    X_val,y_val=X[val],y[val]
    model.fit(X_train,y_train)
    _,accu,pr_auc,f2=get_scores(model,X_val,y_val)
    accu_list.append(accu)
    pr_auc_list.append(pr_auc)
    f2_list.append(f2)
  print(f'Training Accuracy: {np.mean(accu_list):.3f}')
  print(f'Training PR_AUC: {np.mean(pr_auc_list):.3f}')
  print(f'Training F2: {np.mean(f2_list):.3f}')
  return model

So, after training and when you want to predict unknown data you would just do:

fitted_model=train_model(model,X,y)

I hope that works out for you. In my case it works everytime. Now, computational cost is different - and very sad - matter.

How to choose a predictive model after k-fold cross-validation?

7 Answers7

Linked

Related