Why is there an asymmetry between the training step and evaluation step?

Question

It is well-known, especially in natural language processing, that machine learning should proceed in two steps, a training step and an evaluation step, and they should use different data. Why is this? Intuitively, this process helps avoid overfitting the data, but I fail to see a (information-theoretic) reason this is the case.

Relatedly, I've seen some numbers thrown around for how much of a data set should be used for training and how much for evaluation, like 2/3 and 1/3 respectively. Is there any theoretical basis for choosing a particular distribution?

score 15 · Accepted Answer · answered Feb 15 '12 at 16:56

It's funny that the most upvoted answer doesn't really answer the question :) so I thought it would be nice to back this up with a bit more theory - mostly taken from "Data Mining: Practical Machine Learning Tools and Techniques" and Tom Mitchell's "Machine Learning".

Introduction.

So we have a classifier and a limited dataset, and a certain amount of data must go into training set and the rest is used for testing (if necessary, a third subset used for validation).

Dilemma we face is this: to find a good classiﬁer, the "training subset" should be as big as possible, but to get a good error estimate the "test subset" should be as big as possible - but both subsets are taken from the same pool.

It's obvious that the training set should be bigger than the test set - that is, the split should not be 1:1 (main goal is to train, not to test) - but it's not clear where the split should be.

Holdout procedure.

The procedure of splitting the "superset" into subsets is called holdout method. Note that you may easily get unlucky and examples of a certain class could be missing (or overpresented) in one of the subsets, which can be addressed via

random sampling, which guarantees that each class is properly represented in all data subsets - the procedure is called stratiﬁed holdout
random sampling with repeated training-testing-validation process on top of it - which is called repeated stratified holdout

In a single (nonrepeated) holdout procedure, you might consider swapping the roles of the testing and training data and average the two results, but this is only plausible with a 1:1 split between training and test sets which is not acceptable (see Introduction). But this gives an idea, and an improved method (called cross-validation is used instead) - see below!

Cross-validation.

In cross-validation, you decide on a ﬁxed number of folds (partitions of the data). If we use three folds, the data is split into three equal partitions and

we use 2/3 for training and 1/3 for testing
and repeat the procedure three times so that, in the end, every instance has been used exactly once for testing.

This is called threefold cross-validation, and if stratiﬁcation is adopted as well (which it often true) it is called stratified threefold cross-validation.

But, lo and behold, the standard way is not the 2/3:1/3 split. Quotting "Data Mining: Practical Machine Learning Tools and Techniques",

The standard way [...] is to use stratiﬁed 10-fold cross-validation. The data is divided randomly into 10 parts in which the class is represented in approximately the same proportions as in the full dataset. Each part is held out in turn and the learning scheme trained on the remaining nine-tenths; then its error rate is calculated on the holdout set. Thus the learning procedure is executed a total of 10 times on different training sets (each of which have a lot in common). Finally, the 10 error estimates are averaged to yield an overall error estimate.

Why 10? Because "..Extensive tests on numerous datasets, with different learning techniques, have shown that 10 is about the right number of folds to get the best estimate of error, and there is also some theoretical evidence that backs this up.." I haven't found which extensive tests and theoretical evidence they meant but this one seems like a good start for digging more - if you wish.

They basically just say

Although these arguments are by no means conclusive, and debate continues to rage in machine learning and data mining circles about what is the best scheme for evaluation, 10-fold cross-validation has become the standard method in practical terms. [...] Moreover, there is nothing magic about the exact number 10: 5-fold or 20-fold cross-validation is likely to be almost as good.

Bootstrap, and - finally! - the answer to the original question.

But we haven't yet arrived to the answer as to, why the 2/3:1/3 is often recommended. My take is that it's inherited from bootstrap method.

It's based on sampling with replacement. Previously, we put a sample from the "grand set" into exactly one of the subsets. Bootstraping is different and a sample can easily appear in both training and test set.

Let's look into one particular scenario where we take a dataset D1 of n instances and sample it n times with replacement, to get another dataset D2 of n instances.

Now watch narrowly.

Because some elements in D2 will (almost certainly) be repeated, there must be some instances in the original dataset that have not been picked: we will use these as test instances.

What is the chance that a particular instance wasn't picked up for D2? The probability of being picked up on each take is 1/n so the opposite is (1 - 1/n).

When we multiply these probabilities together, it's (1 - 1/n)^n which is e^-1 which is about 0.3. This means our test set will be about 1/3 and the training set will be about 2/3.

I guess this is the reason why it's recommended to use 1/3:2/3 split: this ratio is taken from the bootstrapping estimation method.

Wrapping it up.

I want to finish off with a quote from the data mining book (which I cannot prove but assume correct) where they generally recommend to prefer 10-fold cross-validation:

The bootstrap procedure may be the best way of estimating error for very small datasets. However, like leave-one-out cross-validation, it has disadvantages that can be illustrated by considering a special, artiﬁcial situation [...] a completely random dataset with two classes. The true error rate is 50% for any prediction rule.But a scheme that memorized the training set would give a perfect resubstitution score of 100% so that etraining instances= 0, and the 0.632 bootstrap will mix this in with a weight of 0.368 to give an overall error rate of only 31.6% (0.632 ¥ 50% + 0.368 ¥ 0%), which is misleadingly optimistic.

score 13 · Answer 2 · answered Feb 07 '12 at 21:40

Consider a finite set of m records. If you use all the records as a training set you could perfectly fit all the points with the following polynomial:

y = a0 + a1*X+a2*X^2 + ... + an*X^m

Now if you have some new record, not used in training set and values of an input vector X are different from any vector X, used in training set, what can you tell about the accuracy of prediction y?

I suggest you to go over an example where you have 1 or 2-dimensional input vector X (in order to visualize the overfitting polynomial) and check how big is the prediction error for some pair (X, y) which X values are just a little different from the values from the training set.

I don't know if this explanation is theoretic enough, but hopefully it helps. I tried to explain the problem on regression model as I consider it more intuitively understandable than others (SVM, Neural Networks...).

When you build a model, you should split the data into at least training set and test set (some split the data into training, evaluation, and cross validation set). Usually 70% of data is used for training set and 30% for evaluation and then, when you build the model, you have to check the training error and test error. If both errors are big, it means your model is too simple (the model has high bias). On the other hand if your training error is very small but there is a big difference between training and test error, it means your model is too complex (the model has high variance).

The best way to choose the right compromise is to plot training and test errors for models of various complexity and then choose the one where the test error is minimal (see the picture below). enter image description here

This is a really great answer for how - not so much for why. But maybe that is just a [wrong question](http://lesswrong.com/lw/og/wrong_questions/) - what we really care about is what works empirically, not theory. — Tamzin Blake, Feb 07 '12 at 22:12
@winwaed The questions are "Why is there...?", "Why is this?", and "Is there any...?", all clearly indicated by question marks. I'm familiar with the phenomenon, and I find it intuitive, and I'm familiar with empirical examples, but I don't know *why* it is the case, and it seems information theory should have an answer for me. The above comment was just reflection that perhaps "why" questions aren't particularly relevant once you have empirically verified regularities that you can exploit. — Tamzin Blake, Feb 07 '12 at 22:43
@Thom: So really your question is the second paragraph and not the end of the first ("fail to see why") because what works empirically is that you get over-fitting: your model does an excellent job of handling minor quirks in your training data which are not present in the general case. — winwaed, Feb 07 '12 at 22:38

score 7 · Answer 3 · answered Feb 07 '12 at 20:52

7

This is the problem of generalization—that is, how well our hypothesis will correctly classify future examples that are not part of the training set. Please see this fantastic example, what happened in case your model fit only the data you have and not a new one: Titius-Bode law

answered Feb 07 '12 at 20:52

Dov

1,630
3
14
24

A good example - it is very much the same as with scientific hypotheses. We're still talking about models whether they are statistical ML models or models of the universe. – winwaed Feb 07 '12 at 22:39

score 1 · Answer 4 · 2017-11-01T09:04:30.817

So far @andreiser gave a brilliant answer to the second part of OP's question regarding training/testing data split, and @niko explained how to avoid overfitting, but nobody has gotten to the merit of the question: Why using different data for training and evaluation helps us avoid overfitting.

Our data is split into:

Training instances
Validation instances
Test (evaluation) instances

Now we have a model, let's call it $\mathfrak{M}$. We fit it using the training instances and check its accuracy using the validation instances. We may even do cross validation. But why on earth would we check it again using the test instances?

The problem is that in practice, we try many different models, $\mathfrak{M}_1, ..., \mathfrak{M}_n$, with different parameters. This is where overfitting occurs. We selectively choose the model that performs the best on the validation instances. But our goal is to have a model that performs well in general. This is why we have the test instances - unlike the validation instances, test instances aren't involved in choosing the model.

It is important to realise what are the different roles of the Validation and Test instances.

Training instances - used to fit the models.
Validation instances - used to choose a model
Test (evaluation) instances - used to measure a model's accuracy on new data

See the page 222 of The Elements of Statistical Learning: Data Mining, Inference, and Prediction for more details.

Why is there an asymmetry between the training step and evaluation step?

4 Answers4

Linked