2

I'm performing a study where I'm selecting kernel type and hyperparameters in an inner CV loop and an outer loop doing 10-fold CV (using SVR). The output is 10 trained models and performance measures.

My question is where do I go from here. When I train a new model with the complete dataset using the selected kernel (by either the hyperparameters that gave the min error during the 10-fold CV or finding the optimal ones with the selected kernel for the complete dataset) the final model I end up with is not validated against training data. Is it reasonable to do this and use the average error previously obtained from 10-fold CV as an "informal" performance estimate since the model is trained on a slightly larger dataset? How would I word this in a journal paper? My thesis advisor is questioning it for one.

amoeba
  • 93,463
  • 28
  • 275
  • 317
E.Koz
  • 21
  • 4

2 Answers2

1

It sounds like you're taking the correct approach - you'll want to do a nested CV so that you tune your parameters on the inner dataset, and then estimate the error on a holdout set that the model has never seen before. As an example:

Divide your training set into 10 folds. Use the 9 folds to tune your model (again through CV), and then estimate the error on the 10th fold that you held out. You can do this 10 times to get an estimate of the error (you could actually do it another set of 10 times if you randomly generated a different set of folds).

The Elements of Statistical Learning explicitly warns against "[Using] cross-validation to estimate the unknown tuning parameters and to estimate the prediction error of the final model." - emphasis mine (see Ch 7, Section 10.2). You can of course use CV to estimate these separately.

If you need another a citation for the importance of this, some researchers at Google just released a paper related to this. If you have access to the paper on Science, they even released their Python code along with the article.

Tchotchke
  • 894
  • 6
  • 19
1

train a new model with the complete dataset using the selected kernel (by either the hyperparameters that gave the min error during the 10-fold CV or finding the optimal ones with the selected kernel for the complete dataset)

You need to keep the hyperparameters you found by the inner CV, the outer CV can only be used as a surrogate of an independent test of the final model if no whatsoever further optimization/selection takes place.

For reporting in the paper as well as for your information, I suggest looking at

  • the difference between the estimates in the inner vs. outer CV. This gives an indication whether there might be problems with overfitting in the hyperparameter optimization.
  • Look at the variation within the models of the inner (optimization) CV. Is the minimum meaningful wrt. to the uncertainty of the testing due to the finite number of test cases?
  • How stable are the reported hyperparameters over the 10 folds of the outer CV?
  • How stable is the performance over the surrogate models of the outer cross validation. You can draw conclusions on this only if you either have enough test cases in each of the folds, or do iterated/repeated cross validation for the outer loop.
cbeleites unhappy with SX
  • 34,156
  • 3
  • 67
  • 133