While there are many answers that explain the reasoning behind train/validation/split, these are usually concerned with one learning method. So there are two parts to this question:
Part 1: Suppose I am only interested in a specific algorithm e.g. SVM:
As far as I understand, the process is the following:
- Train many specifications (different hyperparameters) of SVM on the trainset
- Apply them all to validation set to choose the best specification SVM* (i.e. tuning)
- Apply the best performing SVM specification, SVM*, on the test set to approximate the generalizing error E(SVM*)
- No more tuning is allowed after having seen the test error
The only question here is: what if I discover that there are additional hyperparameters of SVM that I have not tuned yet, or if I want to extend the search space in existing hyperparameters? Is it cheating if I do the following: repeat validation/tuning with the new hyperparameter space, select the new best specification SVM** and apply it again on the test set to approximate the new error? As always, the best specification is chosen on validation set and not test set
The question is related to the discussion in the comment section of this answer https://stats.stackexchange.com/a/153058/307304
Part 2: Suppose I want to experiment with multiple learning methods e.g. SVM, RF and KNN
Here, I see there are two decision problems:
- Model tuning: As in Part 1, find the best parameter specification for each method, (SVM*,RF*,KNN*)
- Method selection: given the best specification for each method, decide upon the best method (i.e. SVM vs RF vs KNN)
I suppose that I understand model tuning, which is always performed on the validation test. But what about method selection? I see two options:
- val-select: Select the best method according to the validation error. Then only the best performing method in validation error e.g. RF* is applied to the test set in order to approximate the generalizing error E(RF*)
- test-select: Apply the best specification of each model on the test set, in order to approximate the generalizing error of all methods E(RF*), E(SVM*), E(KNN*). Now we assume/report that the best performing method is the one with the lowest generalizing error.
What is the correct way, and should test-select be considered cheating? As far as I know, test-select is often used in academic papers when they provide tables for comparing algorithms. If model tuning is not acceptable on the test set, how come it is acceptable to do method selection on the test set, or at least report test set result for multiple methods in academic papers?