It's funny that the most upvoted answer doesn't really answer the question :) so I thought it would be nice to back this up with a bit more theory - mostly taken from "Data Mining: Practical Machine Learning Tools and Techniques" and Tom Mitchell's "Machine Learning".
Introduction.
So we have a classifier and a limited dataset, and a certain amount of data must go into training set and the rest is used for testing (if necessary, a third subset used for validation).
Dilemma we face is this: to find a good classifier, the "training subset" should be as big as possible, but to get a good error estimate the "test subset" should be as big as possible - but both subsets are taken from the same pool.
It's obvious that the training set should be bigger than the test set - that is, the split should not be 1:1 (main goal is to train, not to test) - but it's not clear where the split should be.
Holdout procedure.
The procedure of splitting the "superset" into subsets is called holdout method. Note that you may easily get unlucky and examples of a certain class could be missing (or overpresented) in one of the subsets, which can be addressed via
- random sampling, which guarantees that each class is properly represented in all data subsets - the procedure is called stratified holdout
- random sampling with repeated training-testing-validation process on top of it - which is called repeated stratified holdout
In a single (nonrepeated) holdout procedure, you might consider swapping the roles of the
testing and training data and average the two results, but this is only plausible with a 1:1 split between training and test sets which is not acceptable (see Introduction). But this gives an idea, and an improved method (called cross-validation is used instead) - see below!
Cross-validation.
In cross-validation, you decide on a fixed number of folds (partitions of the data). If we use three folds, the data is split into three equal partitions and
- we use 2/3 for training and 1/3 for testing
- and repeat the procedure three times so that, in the end, every instance has been used exactly once for testing.
This is called threefold cross-validation, and if stratification is adopted as well (which it often true) it is called stratified threefold cross-validation.
But, lo and behold, the standard way is not the 2/3:1/3 split. Quotting "Data Mining: Practical Machine Learning Tools and Techniques",
The standard way [...] is to use stratified 10-fold cross-validation. The data is divided randomly into 10 parts in which the class is represented in approximately the same proportions as in the full dataset. Each part is held out in turn and the learning scheme trained on the remaining nine-tenths; then its error rate is calculated on the holdout set. Thus the learning procedure is executed a total of 10 times on different training sets (each of which have a lot in common). Finally, the 10 error estimates are averaged to yield an overall error estimate.
Why 10? Because "..Extensive tests on numerous datasets, with different learning techniques, have shown that 10 is about the right number of folds to get the best
estimate of error, and there is also some theoretical evidence that backs this up.." I haven't found which extensive tests and theoretical evidence they meant but this one seems like a good start for digging more - if you wish.
They basically just say
Although these arguments are by no means conclusive, and debate continues to
rage in machine learning and data mining circles about what is the best scheme
for evaluation, 10-fold cross-validation has become the standard method in
practical terms. [...] Moreover, there is nothing magic about the exact number
10: 5-fold or 20-fold cross-validation is likely to be almost as good.
Bootstrap, and - finally! - the answer to the original question.
But we haven't yet arrived to the answer as to, why the 2/3:1/3 is often recommended. My take is that it's inherited from bootstrap method.
It's based on sampling with replacement. Previously, we put a sample from the "grand set" into exactly one of the subsets. Bootstraping is different and a sample can easily appear in both training and test set.
Let's look into one particular scenario where we take a dataset D1 of n instances and sample it n times with replacement, to get another dataset D2 of n instances.
Now watch narrowly.
Because some elements in D2 will (almost certainly) be repeated, there must be some instances in the original dataset that have not been picked: we will use these as test instances.
What is the chance that a particular instance wasn't picked up for D2? The probability of being picked up on each take is 1/n so the opposite is (1 - 1/n).
When we multiply these probabilities together, it's (1 - 1/n)^n which is e^-1 which is about 0.3. This means our test set will be about 1/3 and the training set will be about 2/3.
I guess this is the reason why it's recommended to use 1/3:2/3 split: this ratio is taken from the bootstrapping estimation method.
Wrapping it up.
I want to finish off with a quote from the data mining book (which I cannot prove but assume correct) where they generally recommend to prefer 10-fold cross-validation:
The bootstrap procedure may be the best way of estimating error for very
small datasets. However, like leave-one-out cross-validation, it has disadvantages
that can be illustrated by considering a special, artificial situation [...] a completely random dataset with two classes. The true error rate is 50% for any prediction rule.But a scheme that memorized the training set would give a perfect resubstitution score of 100%
so that etraining instances= 0, and the 0.632 bootstrap will mix this in with a weight
of 0.368 to give an overall error rate of only 31.6% (0.632 ¥ 50% + 0.368 ¥ 0%),
which is misleadingly optimistic.