The question has to do with Model Selection and Evaluation
I'm trying to wrap my head around the scale of how different nested cross validation would be from the following:
Let's say I am attempting to evaluate how suitable a model class is for a particular problem domain.
Let's assume for hypothetical purposes nested cross validation is not possible.
I have a small random dataset for a particular domain that warrants using a grid-search cross validation to do hyperparameter selection rather than some other hyperparameter selection approach (AIC etc.). So I run a Grid Search Cross Validation as a way to find optimal hyperparameters (i.e. the optimal complexity/ flexibility) for this model class on this domain. I let the program run.
A few minutes later I get a fresh new similarly sized random sample from the same domain, a potential test set for the model. But while similarly sized, it is still small, likely risking a high variance for the generalisation error it puts out if it is run as a single test set.
Thus, I was wondering, would it be valid that I take the selected hyperparameters from step 2 (a procedure which is meant to find the likely hyperparameters for the best complexity/ flexibility to minimise error for that model class on that particular domain) and run a new cross-validation on the fresh sample from step 3 as an estimate of generalisation error given the small test set?
My thinking is that if the cross validation selection step is meant to find the optimal complexity for that model class[1] [2], can't I just use those hyperparameters on a fresh cross validation to find generalisation error?
At the moment I feel the flaws in my thinking are:
A. That not using the new test set in the second step biases the results to over-estimate generalisation error compared to nested cross-validation.
B. And also because the datasets being used are small, further effort in bootstrapping and using repeated cross validation could improve the standard errors of the generalisation error estimate.
Thank you for your time.
[1] James et al 2013 An Introduction to Statistical Learning P.183
"We find in Figure 5.6 that despite the fact that they sometimes underestimate the true test MSE, all of the CV curves come close to identifying the correct level of flexibility—"
[2] James et al 2013 An Introduction to Statistical Learning P.186
"Though the cross-validation error curve slightly underestimates the test error rate, it takes on a minimum very close to the best value for K."