2

I know that a train-validation-test splits the data into:

  • a training dataset - obviously my "in-sample" data
  • a validation dataset
  • a test data set - obviously my "out-of-sample" data

My question is: Should I refer to the validation dataset as in-sample or out-of-sample data?

If we're using the validation dataset to fine-tune the parameter values, then the model has seen this data before. So I'm thinking it is "in-sample" data. Am I right?

Thanks for your help!

Kitty Kenty.

KittyKenty
  • 55
  • 6

1 Answers1

2

Generally splits are done like this:

a) Train

b) Test

Generally, the train data is then split in $n$ parts. $n-1$ of them are used for training and remaining $1$ is used for validation. And, this process is repeated until all the $n$ parts become validation sets once.

So, yes, validation data is your in-sample data

naive
  • 899
  • 1
  • 9
  • 14
  • What you are describing is *k-fold cross-validation*, and in that context as well as in regular one-time validation the validation set is *out-of-sample* data because for each fold, the model wasn't trained on it, only tested. A good in-depth description can be found here: https://machinelearningmastery.com/k-fold-cross-validation/. In the context of *hyperparameter tuning*, however, you can argue that it is *in-sample* data because *you* have seen it and possibly tuned the model to overfit it. This is why we need a third set for testing the final model. – runcoderun May 22 '19 at 19:21