1

Assume we want to use k-fold cross-validation to get an estimate for the expected prediction error. Let's assume we use SVM with the hyperparameters fixed. From my understanding, in each iteration of the training fold, our algorithm would optimize a different optimization problem, with different support vectors and different decision boundaries. If we don't have a test set, we would be done here.

When we use nested CV, assuming our outer "nest" simply consists of a train/test split, we would similarly use k-fold CV in the inner loop, get an estimate of the prediction error, and then apply the model that we already preselected on the hold-out set.

In our example with the nested CV, which decision boundary are we using for the test set? I assume we take the best model and refit it on the entire inner training set.

What would happen if, instead of already using a pre-determined model, we use GridSearchCV? For CV, we would iterate over all combinations of the hyperparameters and select the one with the lowest estimation error. If we were to implement this in a nested loop, would this be the way to get a result:

  1. Outer loop 10-fold CV
  2. Get the Training set of the outer loop and implement 10-fold CV
  3. On this training set, find the model which results in the lowest prediction error
  4. fit the model on the entire training set
  5. use test set to get an estimate of the generalization error
  6. repeat 2)

The reason why I am asking is that I read a paper where they defined CV in a way so that after you are done with CV, you run the best model again on your original (entire) training set to get an estimate for the test error. Then they said that this leads to an overestimation of the performance. This is so obvious that I was not sure what to make out of it and it confused me how people are using CV in practice.

About the nested cross-validation, it could happen that I select a different "best" model for every iteration of the outer loop. For example: Loop 1) SVM + linear + C = 10 Loop 2) SVM + linear + C = 20 ...

In the end, I would get 10 estimates of maybe 10 different models. How can I interpret those results?

crabnebul
  • 21
  • 1

0 Answers0