8

I want to choose a model class (e.g. logistic regression vs. random forests), but the validation set is used for selecting hyperparameters. Should I set aside a second validation set to select the model class?

My idea:

  • Training set: choose parameters
  • Validation set: choose hyperparameters
  • Second validation set: choose model class (e.g. logistic regression vs. random forests)
  • Test set: test model on unseen data

Or should I treat model class similarly as hyperparameters and select it based on the validation set performance?

Furthermore, we apply validation sets via cross-validation. Should I use a "nested" cross-validation to select the model class? A CV within a CV?

2 Answers2

8

Prior to the revival of deep learning in the last few years, hyperparameter tuning used to be called model selection. The purpose of the validation set is to choose among several model candidates. It shouldn't make a difference whether these models have the same architecture with different hyperparameters or are completely different architectures.

So no, you shouldn't need a second validation set.

tddevlin
  • 3,205
  • 1
  • 12
  • 27
  • +1 exactly what I thought, I would just use training data set for hyperparameters... maybe we are just using the term "hyperparameters" for different thing :-) See my answer. – Tomas Nov 08 '19 at 19:54
0

I am actually doing this right now too! :) I have 3 model classes, logistic, random forest and GP.

My design is this (with 5-fold crossvalidation):

  • training data set - optimize parameters and hyperparameters (not sure if we have the same definition of hyperparameters; in my case these are the length-scales for GP covariance matrix).

  • validation data set - cross-validate models and compare them within & between classes using common test statistics

I suppose this should be perfectly OK, if you have any ideas why this could be a problem let us discuss it.

Tomas
  • 5,735
  • 11
  • 52
  • 93
  • I guess one clarifying point is when you said "choose hyperparameters" in the training set. To be clear, within a training set you should _set_ your hyperparameters and choose parameters via MSE or whatever. Do this over all model-class/hyperparameter combinations. Then, for each combo, measure out-of-sample performance on the validation set. Finally, my idea for comparing model classes is to select the model within each class with the _best out-of-sample performance_ (i.e. hyperparams gave best validation performance). Then compare these class-best models between classes. What do you think? – Robert Hatem Nov 08 '19 at 20:05
  • *" To be clear, within a training set you should set your hyperparameters and choose parameters"* - I am still not understanding what you mean by this. What is the difference between *set* and *choose*? What I am doing actually is to optimize these (hyper)parameters with L-BFGS optim(), which I guess is perfectly OK with the training set. – Tomas Nov 08 '19 at 20:09
  • ... i.e., it's not clear to me why for an optimization of hyperparameters with L-BFGS would I use the validation dataset. – Tomas Nov 08 '19 at 20:11
  • Sorry, I was unclear. You optimize parameters on a training set, and optimize hyperparameters using some validation set. You cannot (should not?) optimize both on one set. If you're using CV _within_ the "training" set to optimize hyperparameters, then you're good. – Robert Hatem Nov 08 '19 at 20:13
  • @RobertHatem you are mixing things up totally now :-) 1) *"You optimize parameters on a training set, and optimize hyperparameters using some validation set."* no! I told you above, I am using training set for optimizing both! :-) 2) *"You cannot (should not?) optimize both on one set."* why? 3) *" If you're using CV within the "training" set to optimize hyperparameters"* no. I am using the CV with the validation set for validating (second point in my answer), please check my answer. – Tomas Nov 08 '19 at 20:19
  • @RobertHatem don't know if I was unclear, updating my answer a bit – Tomas Nov 08 '19 at 20:22
  • 2) "You cannot (should not?) optimize both on one set" _Why_? The highest-capacity models will always have the best training set performance. You'll choose the highest-capacity logistic regression and highest-capacity random forests, etc. Instead, you want a logistic regression (or random forest, etc.) that _generalizes_ well, not one with the highest capacity. See the 2nd paragraph of this answer: [Is using both training and test sets for hyperparameter tuning overfitting?](https://stats.stackexchange.com/a/366883/221100). – Robert Hatem Nov 08 '19 at 20:58
  • Let us [continue this discussion in chat](https://chat.stackexchange.com/rooms/100834/discussion-between-robert-hatem-and-curious). – Robert Hatem Nov 08 '19 at 21:05