0

While there are many answers that explain the reasoning behind train/validation/split, these are usually concerned with one learning method. So there are two parts to this question:

Part 1: Suppose I am only interested in a specific algorithm e.g. SVM:

As far as I understand, the process is the following:

  • Train many specifications (different hyperparameters) of SVM on the trainset
  • Apply them all to validation set to choose the best specification SVM* (i.e. tuning)
  • Apply the best performing SVM specification, SVM*, on the test set to approximate the generalizing error E(SVM*)
  • No more tuning is allowed after having seen the test error

The only question here is: what if I discover that there are additional hyperparameters of SVM that I have not tuned yet, or if I want to extend the search space in existing hyperparameters? Is it cheating if I do the following: repeat validation/tuning with the new hyperparameter space, select the new best specification SVM** and apply it again on the test set to approximate the new error? As always, the best specification is chosen on validation set and not test set

The question is related to the discussion in the comment section of this answer https://stats.stackexchange.com/a/153058/307304

Part 2: Suppose I want to experiment with multiple learning methods e.g. SVM, RF and KNN

Here, I see there are two decision problems:

  • Model tuning: As in Part 1, find the best parameter specification for each method, (SVM*,RF*,KNN*)
  • Method selection: given the best specification for each method, decide upon the best method (i.e. SVM vs RF vs KNN)

I suppose that I understand model tuning, which is always performed on the validation test. But what about method selection? I see two options:

  • val-select: Select the best method according to the validation error. Then only the best performing method in validation error e.g. RF* is applied to the test set in order to approximate the generalizing error E(RF*)
  • test-select: Apply the best specification of each model on the test set, in order to approximate the generalizing error of all methods E(RF*), E(SVM*), E(KNN*). Now we assume/report that the best performing method is the one with the lowest generalizing error.

What is the correct way, and should test-select be considered cheating? As far as I know, test-select is often used in academic papers when they provide tables for comparing algorithms. If model tuning is not acceptable on the test set, how come it is acceptable to do method selection on the test set, or at least report test set result for multiple methods in academic papers?

Enk9456
  • 23
  • 3
  • 1
    The point of the final test data is that you have not seen it when choosing and tuning and training your model. If you are using the test data to choose your final model, or if you get ideas for improving your model after having seen the test data, then it no longer performs this role, and becomes a form of validation data instead. You would need new unseen test data to perform a test on your new final model, and that may not be available. – Henry Mar 10 '21 at 11:43
  • Great. But can you use a single test dataset to compare the generalization error of different methods? – Enk9456 Mar 10 '21 at 12:09
  • 1
    If you are using the set to choose between methods then I would call it a validation set not a test set. – Henry Mar 10 '21 at 13:02

0 Answers0