How to choose the training, cross-validation, and test set sizes for small sample-size data?

Question

Assume I have a small sample size, e.g. N=100, and two classes. How should I choose the training, cross-validation, and test set sizes for machine learning?

I would intuitively pick

Training set size as 50
Cross validation set size 25, and
Test size as 25.

But probably this makes more or less sense. How should I really decide these values? May I try different options (though I guess it is not so preferable... increased possibility of over learning)?

What if I had more than two classes?

100 is too small for me. I would opt for a leave-one-out strategy for both cross-validation and test evaluation. — Memming, Sep 01 '14 at 18:55
I haven't seen any literature on this (minimum sample sizes for validation). Not sure why. Seems like an important issue. — charles, Sep 01 '14 at 20:17
There is new theoretical research on this topic, see https://arxiv.org/abs/2112.05977 — user343460, Dec 15 '21 at 15:12

score 18 · Answer 1 · edited Apr 13 '17 at 12:44

You surely found the very similar question: Choice of K in K-fold cross-validation ?
(Including the link to Ron Kohavi's work)
If your sample size is already small I recommend avoiding any data driven optimization. Instead, restrict yourself to models where you can fix hyperparameters by your knowledge about model and application/data. This makes one of the validation/test levels unnecessary, leaving more of your few cases for training of the surrogate models in the remaining cross validation.
IMHO, you anyways cannot afford very fancy models with that sample size. And almost certainly you cannot afford to do any meaningful model comparisons (for sure not unless you use proper scoring rules and paired analysis techniques).
This decision is far more important than the precise choice of $k$ (say, 5-fold vs. 10-fold) - with the important exception that leave one out is not recommended in general.
Interestingly, with these very-small-sample-size classification problems, validation is often more difficult (in terms of sample size needs) compared to training of a decent model. If you need any literature on this, see e.g. our paper on sample size planning:
Beleites, C. and Neugebauer, U. and Bocklitz, T. and Krafft, C. and Popp, J.: Sample size planning for classification models. Anal Chim Acta, 2013, 760, 25-33.
DOI: 10.1016/j.aca.2012.11.007
accepted manuscript on arXiv: 1211.1323
Another important point is to make good use of the possibility to iterate/repeat the cross validation (which is one of the reasons against LOO): this allows you to measure the stability of the predictions against perturbations (i.e. few different cases) of the training data.

Literature:
- Beleites, C. & Salzer, R.: Assessing and improving the stability of chemometric models in small sample size situations Anal Bioanal Chem, 2008, 390, 1261-1271.
  DOI: 10.1007/s00216-007-1818-6
- Dixon, S. J.; Heinrich, N.; Holmboe, M.; Schaefer, M. L.; Reed, R. R.; Trevejo, J. & Brereton, R. G.: Application of classification methods when group sizes are unequal by incorporation of prior probabilities to three common approaches: Application to simulations and mouse urinary chemosignals, Chemom Intell Lab Syst, 2009, 99, 111-120.
  DOI: 10.1016/j.chemolab.2009.07.016
If you decide for a single run on a hold-out test set (no iterations/repetitions),
- keep in mind that most of the mistakes you can do with cross validation (which will lead to an optimistic bias) can also happen with a hold-out test set.
- check the width of the resulting confidence interval for the performance measurement, and make sure that this allows meaningful interpretation of the results (see sample size planning paper).

+1 purely for the advice on parameter optimization and model complexity. but all of this advice is fantastic. — charles, Sep 03 '14 at 15:46

score 1 · Answer 2 · answered Sep 02 '14 at 00:42

Given that your sample size is small a good practice would be to leave out the cross-validation section and use a 60 - 40 or 70 - 30 ratio.

As you can see in section 2.8 of Introduction to Clementine and Data Mining and also in MSDN Library - Data Mining - Training and Testing Sets a 70 - 30 ratio is common. According to Andrew Ng's Machine Learning lectures a 60 - 20 - 20 ratio is recommended.

Hope I was helpful. Best Regards.

How to choose the training, cross-validation, and test set sizes for small sample-size data?

2 Answers2

Linked