Is it Valid to Grid Search Cross Validation for Model Hyperparameter Selection then a separate Cross Validation for Generalisation Error?

Question

The question has to do with Model Selection and Evaluation

I'm trying to wrap my head around the scale of how different nested cross validation would be from the following:

Let's say I am attempting to evaluate how suitable a model class is for a particular problem domain.

Let's assume for hypothetical purposes nested cross validation is not possible.
I have a small random dataset for a particular domain that warrants using a grid-search cross validation to do hyperparameter selection rather than some other hyperparameter selection approach (AIC etc.). So I run a Grid Search Cross Validation as a way to find optimal hyperparameters (i.e. the optimal complexity/ flexibility) for this model class on this domain. I let the program run.
A few minutes later I get a fresh new similarly sized random sample from the same domain, a potential test set for the model. But while similarly sized, it is still small, likely risking a high variance for the generalisation error it puts out if it is run as a single test set.
Thus, I was wondering, would it be valid that I take the selected hyperparameters from step 2 (a procedure which is meant to find the likely hyperparameters for the best complexity/ flexibility to minimise error for that model class on that particular domain) and run a new cross-validation on the fresh sample from step 3 as an estimate of generalisation error given the small test set?

My thinking is that if the cross validation selection step is meant to find the optimal complexity for that model class[1] [2], can't I just use those hyperparameters on a fresh cross validation to find generalisation error?

At the moment I feel the flaws in my thinking are:

A. That not using the new test set in the second step biases the results to over-estimate generalisation error compared to nested cross-validation.

B. And also because the datasets being used are small, further effort in bootstrapping and using repeated cross validation could improve the standard errors of the generalisation error estimate.

Thank you for your time.

[1] James et al 2013 An Introduction to Statistical Learning P.183

"We find in Figure 5.6 that despite the fact that they sometimes underestimate the true test MSE, all of the CV curves come close to identifying the correct level of flexibility—"

[2] James et al 2013 An Introduction to Statistical Learning P.186

"Though the cross-validation error curve slightly underestimates the test error rate, it takes on a minimum very close to the best value for K."

Why would cross validation on the new dataset provide you any better an estimate than just using the new dataset as a test set? If you are worried that the dataset is small and might not generalize well doing cross validation isn't going to make that situation any different. Building k new models is only going to increase the bias and variance in your estimate. — astel, Oct 14 '20 at 06:38

cbeleites unhappy with SX · Accepted Answer · 2020-10-15T09:35:37.677

First of all, yes you can get a valid cross validated performance estimate from a 2nd data set where you train with fixed hyperparameters.

However, consider the statistical properties:

the variance of the performance estimate due to finite sample sample size is the same whether you test a model m1 trained on data set 1 on n2 cases of data set 2 vs. testing surrogate models m2i trained on k-1 folds of data set 2: the total number of tested cases is the same.
The cross validation estimate will be slightly pessimistically biased compared to the final model m2 which is trained on data set 2 (with the hyperparameters taken from training on data set 1).
The cross validation estimate also contains some variance due to model instabiliy, i.e. differences in true performance across the k2 surrogate models.
This is relevant for the uncertainty of the cross validation estimate since that is taken as approximation for another model m2. But if model m1 is tested on data set 2, instability is not relevant since you actually test the final model without the additional approximation step.

(This is what also @astel's answer says, just in different words in case that helps)

All in all, your 2nd cross validation is not invalid in the sense that you don't do anything that violates independence. But outside the corner case that model m2 is truly better than model m1, your procedure is no improvement over using data set 2 as independent test set.

That is, the following corner case: data sets 1 and 2 are sufficiently similar that you can reasonably assume the optimal hyperparameters to be the same across the data sets, but you expect the data sets to be sufficiently different that you'd want to retrain the model parameters on data set 2.

Since you can obtain a second set, you may want to make full use of the advantages a truly separate test set can have over train/test splits (single split or cross validation or any other resampling validation). See e.g. Hold-out validation vs. cross-validation and maybe also Is hold-out validation a better approximation of "getting new data" than k-fold CV?.

In case you need an estimate of model instability in addition to the independent test set performance on data set 2, you can also get that at very low computational effort: predict data set 2 also with the k surrogate models you have from data set 1 (m1i) and look at the variance across the predictions for each case. This is actually more efficient than even repeated cross validation since you have a fully crossed design case x surrogate model.

All this assumes that data sets 1 and 2 are of similar size: if the sizes vary substantially, you couldn't assume the same hyperparameters to work well.

Thank you for the detailed response sir, you've given me a lot of fruit for thought. Stay safe! — Benjamin Phua, Oct 15 '20 at 12:14

astel · Answer 2 · 2020-10-15T14:45:31.100

0

Let's say you have the same number of records 'n1' and 'n2' in each of these two data sets. You perform k-fold cross validation on the first data set and select optimal hyper-parameters.

The traditional way (in your scenario) would be to select the second dataset as a test set and now you are testing n2 records on a model trained on n1 records.

What you are suggesting will lead you to testing n2 records on a model that is trained on n1*(k-1)/k records since you want to do cross validation on the test set. This will lead to a pessimistic bias in your estimate since you are training on less records. It will also lead to more variance in your estimate since now you are adding randomness due to splitting your data in the test set.

Okay, rereading all of your comments I think I understand what you are trying to do. The reason you want to perform cross-validation on your test set rather than simply using the test set to estimate error is because you also want to use this test set to determine model complexity (i.e. you want to do feature selection on your test set). This will be problematic as your estimate is going to be optimistically biased then since the entirety of your test set is used for feature selection and thus it is part of the training process and you have no final test set with which to evaluate your error rate.

I understand you don't want to do nested cross-validation for time reasons but you could simply combine your old and new data and do a single train/test split (i.e. a single outer loop) and do cross validation on the training set to find hyper-parameters and model complexity (feature selection) at the same time. Finding optimal features is essentially hyper-parameter selection after all.

If you want to separate hyper-parameter selection and model complexity as in your current method you are going to need a third data set with which to estimate error (or simply split your 2nd data set into train/test before performing cross validation).

edited Oct 15 '20 at 14:45

answered Oct 14 '20 at 06:50

astel

1,388
5
17

Hi Astel, thank you for your response. Sorry I am still wrapping my head around this as I want to avoid nested cross-validation due to time constraints. Let me try to express my thoughts once more. According to [1] and [2] above, the gridsearch CV approach may overestimate generalisation error but it tends to pick the hyperparameters that set the right level of complexity/ flexibility for this model class to perform at its best for this data domain. – Benjamin Phua Oct 14 '20 at 06:56
continued: Thus instead of using the first set of records to fit the model, rather it is used only to decide to level of flexibility to use in that model class, and then the fit itself is done on the second set of data based on that level of complexity resulting in the expected generalisation. Again this would appear to be wasting the first set of data just for setting the complexity but I am wondering if it some merit. – Benjamin Phua Oct 14 '20 at 06:57
But if you already found optimal hyper-parameters in dataset 1 what is the benefit of doing cross validation on dataset 2 if you aren't trying to find hyper-parameters? – astel Oct 14 '20 at 07:00
Thanks again for your reply Astel, the goal of the experiment is to evaluate a particular learning algorithm on a problem domain rather than find hyper-parameters. So the generalisation error is meant to show in a relatively statistically robust way how well this type of model class (e.g. Radial SVM) would work on a particular problem domain (e.g. Detecting a particulalr genetic defect) rather than doing hyperparameter selection. – Benjamin Phua Oct 14 '20 at 08:28
As such the role of the hyperparameter selection in only to optimise the level of complexity in the model for this problem domain (e.g. What value of gamma and cost to use that will minimise general error). – Benjamin Phua Oct 14 '20 at 08:29
Yes I understand how cross validation works and the role of hyper-parameter selection the problem is I don't think you do. Why do you think performing cross-validation on what is effectively your test set will give you a better estimate of your generalization error? (It won't it will give a worse one). – astel Oct 14 '20 at 16:50
Hi Astel, thanks again, and fully agree I could be completely wrong. For the test set which was originally smaller, we were able to get a fair amount more data almost to the same size as the original training set. So we could re-run everything with nested cross validation as we now have enough data to do so but we don't have the time. So I figured that we could justify using the original training set for hyperparameter optimisation and using the second dataset only for getting a better estimate for generalisation error. – Benjamin Phua Oct 15 '20 at 07:53
Specifically, instead of training on gridsearch CV on D1 and testing on entire D2, we use D1 to find the "best" complexity for each model class under investigation and because D2 is now about the same size as D1, we then use the fixed hyperparameters from the D1 experiment and re-run cross validation on D2 which would re-fit the model at the "optimal" complexity and get an average generalisation error across D2. – Benjamin Phua Oct 15 '20 at 07:56
And thus, (sorry for the length), where testing on the entire D2 could lead to a high variance in generalisation error (because the dataset maybe too small to be representative of the population). The cross validated average error could be argued to have less variance. – Benjamin Phua Oct 15 '20 at 07:57
as per James et al 2013 An Intro to Statistical Learning P. 178 – Benjamin Phua Oct 15 '20 at 08:09
Thanks for having this discourse with me Astel, I appreciate your thoughts. – Benjamin Phua Oct 15 '20 at 08:54
See my edit in my answer – astel Oct 15 '20 at 14:35

Is it Valid to Grid Search Cross Validation for Model Hyperparameter Selection then a separate Cross Validation for Generalisation Error?

2 Answers2