Why should I split my well sampled data into training, test, and validation sets?

Question

Given I have a proper/well sampled data set, i.e. my data characteristics perfectly reflect the characteristics of the global population from which my data stems, why would I split it into training, test, and validation sets?

If I split my data set using proper sampling, the three resulting sets would all have the same characteristics (in the aggregate). If I split my data set using improper sampling, the three resulting sets would have different characteristics (in the aggregate) than the global population and are inferior to my initial data set.

Where is my reasoning error? Thank you very much!

Do your model and the estimates it produces also *perfectly* reflect, *without any error whatsoever,* how the world works and the measurement process you used? — whuber, Jan 03 '18 at 14:16
Another reason is to allow you to verfiy your claim that your sample perfectly reflects the characteristics of the global population. Finding the error on the data used to fit your model will be biased, using hold out data gives you an unbiased estimator. There is a good discussion of this here: http://www.stat.cmu.edu/~ryantibs/advmethods/notes/errval.pdf. Even though it isn't directly considering your question, I think it many of the reasons it covers still apply. — Dan, Jan 03 '18 at 15:28
Thanks for the replies and the pdf. You are right that the procedure of splitting my data can be used to verify my claim of perfect representativity. After reading your comments and the comments below, I still think the data split is not necessary if I know my sample is a perfect representation of the global population or if my sample is actually the global population (what is possible sometimes). — John Chins, Jan 05 '18 at 09:24

score 2 · Answer 1 · answered Jan 03 '18 at 14:55

Data is often separated into training, test, and validation sets in order to account for over-training of models. Validation sets can also be used to compare the performance of different models to each other.

When you use test/validation sets you can get use it to get an idea of how over-fitted your model is, and how it may perform on additional data added later. - In example, 90% accuracy on the training data and 30% accuracy on the test data is over-fit data and will likely not predict well for items outside of the current data set. - In contrast, 80% accuracy on the training data and 75% accuracy on the test data is likely a much better model if you are using it on data outside of the current data set.

Also, it may be helpful to compare different models. If both models were made using the answer key (the validation data) then they may be over-fit to the answer-key. Comparing the models to each other without a validation set will reward a model that over-fitted to the data! However, by using a validation set that is kept separate you prevent the model from over-fitting to the answer key. This penalizes a model that over-fits data because it won't have the validation set to over-fit to.

score 1 · Answer 2 · answered Jan 03 '18 at 14:38

In the limit where you have infinitely many training samples, what you propose (not to split into three sets) is valid.

But imagine a regression problem with pairs (x,y), with x$\in$[0,10]. Imagine the underlying relation is y=2*x. Let us exaggerate for the sake of argument and say you have 3 training data points: {(1,2), (9,18), (5,10)}. Now you train a network to solve the regression problem. Then you train it to make the training error zero, which a very simple network can achieve. But the network will learn a random mapping for the values which are not in the training set. So when you input a new test point, it will give you a nonsense value.

Now the condition in the question is that you have a training set which can perfectly reflect the characteristics of your data. However unless you have infinitely many points in the interval [0,10], you will always have some gaps in your regression function that the network learns. Hence you will always have an over-fitting situation.

As whuber also stated in his comment, another effect is the effect of the noise on the data. Which means, for every point in your training data, you additionally need infinitely many points or each possible value of, say, Gaussian noise, you you can learn to overlook the noise as well. Otherwise, it is the well known over-fitting.

Why should I split my well sampled data into training, test, and validation sets?

2 Answers2

Linked