14

Assume I have a small sample size, e.g. N=100, and two classes. How should I choose the training, cross-validation, and test set sizes for machine learning?

I would intuitively pick

  • Training set size as 50
  • Cross validation set size 25, and
  • Test size as 25.

But probably this makes more or less sense. How should I really decide these values? May I try different options (though I guess it is not so preferable... increased possibility of over learning)?

What if I had more than two classes?

est
  • 141
  • 1
  • 1
  • 3
  • 2
    100 is too small for me. I would opt for a leave-one-out strategy for both cross-validation and test evaluation. – Memming Sep 01 '14 at 18:55
  • 1
    I haven't seen any literature on this (minimum sample sizes for validation). Not sure why. Seems like an important issue. – charles Sep 01 '14 at 20:17
  • There is new theoretical research on this topic, see https://arxiv.org/abs/2112.05977 – user343460 Dec 15 '21 at 15:12

2 Answers2

18
cbeleites unhappy with SX
  • 34,156
  • 3
  • 67
  • 133
  • +1 purely for the advice on parameter optimization and model complexity. but all of this advice is fantastic. – charles Sep 03 '14 at 15:46
1

Given that your sample size is small a good practice would be to leave out the cross-validation section and use a 60 - 40 or 70 - 30 ratio.

As you can see in section 2.8 of Introduction to Clementine and Data Mining and also in MSDN Library - Data Mining - Training and Testing Sets a 70 - 30 ratio is common. According to Andrew Ng's Machine Learning lectures a 60 - 20 - 20 ratio is recommended.

Hope I was helpful. Best Regards.

mrdatamx
  • 11
  • 2