in-sample data vs out-of-sample data

Question

I know that a train-validation-test splits the data into:

a training dataset - obviously my "in-sample" data
a validation dataset
a test data set - obviously my "out-of-sample" data

My question is: Should I refer to the validation dataset as in-sample or out-of-sample data?

If we're using the validation dataset to fine-tune the parameter values, then the model has seen this data before. So I'm thinking it is "in-sample" data. Am I right?

Thanks for your help!

Kitty Kenty.

score 2 · Accepted Answer · answered Oct 23 '18 at 10:48

2

Generally splits are done like this:

a) Train

b) Test

Generally, the train data is then split in $n$ parts. $n-1$ of them are used for training and remaining $1$ is used for validation. And, this process is repeated until all the $n$ parts become validation sets once.

So, yes, validation data is your in-sample data

answered Oct 23 '18 at 10:48

naive

899
1
9
14

What you are describing is *k-fold cross-validation*, and in that context as well as in regular one-time validation the validation set is *out-of-sample* data because for each fold, the model wasn't trained on it, only tested. A good in-depth description can be found here: https://machinelearningmastery.com/k-fold-cross-validation/. In the context of *hyperparameter tuning*, however, you can argue that it is *in-sample* data because *you* have seen it and possibly tuned the model to overfit it. This is why we need a third set for testing the final model. – runcoderun May 22 '19 at 19:21

in-sample data vs out-of-sample data

1 Answers1

Linked