4

In statistics, we divide the data into two set: in-sample set and out-of-sample set. In machine learning, the data is divided into 3 sets: training set, testing set and validation set.

From my understanding, in-sample set is equivalent to training set; out-of-sample set is equivalent to testing set. Is my understanding correct?

But what's about validation set? What does it correspond to?

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
NN2
  • 143
  • 1
  • 6
  • 1
    "Is my understanding correct?" Yes. Practices differ, though, and it's not necessarily a statistics-vs.-machine-learning difference. – rolando2 Apr 06 '18 at 15:19

1 Answers1

4

I regard the validation set as part of the training data: you probably need to make various decisions about building your model, and you can inform these decisions by using the part of the training data without the validation set to build various different models and see how well they perform against the validation set. In a sense this process could be described as tuning the model or choosing optimal hyperparameters

You can go further than this and divide the training data into several validation sets (folds in the jargon), and see how each model performs for the validation set based on the rest of the training data; this multiple process is called cross-validation. Once you have settled on the key decisions for your model, you can finally build it using the whole training set

Typically your aim is to reduce the risk of overfitting as you want sensible results when finally comparing with your test data. So in a sense you have used your in-sample data to simulate out-of-sample prediction by using the validation data

Sometimes in-sample is taken to mean interpolation, while out-of-sample is taken to mean extrapolation, with extrapolation prone to larger errors. If so, it can make sense to choose extreme parts of your in-sample data to use for validation, as it may highlight some issues with certain methods such as polynomial regression failing hopelessly outside the range where it has been fitted

Henry
  • 30,848
  • 1
  • 63
  • 107
  • 1
    You make some good points, but I find most people use the terminology differently, as seen in various threads on this site and others. Most, I think, would say the validation set is the one held out the longest; it is not part of the training data. – rolando2 Apr 06 '18 at 15:16
  • 2
    @rolando2 - I believe most people would say the test set should be held out longer than the validation set - see for example the answers to https://stats.stackexchange.com/questions/19048/what-is-the-difference-between-test-set-and-validation-set – Henry Apr 06 '18 at 15:52