1

I know there is a rule of thumb to split the data to 70%-90% train data and 30%-10% validation data. But if my test size is small, for example: its size is 5% of the size of the train set, and I can't make it bigger, should the validation data be the same size as the test set?

Amit S
  • 27
  • 7

1 Answers1

2

There is no hard guidelines. It is a common practice to have validation set and test set of the same size. If you need $N$ samples to assess quality of your results when testing the final results, you probably need similar amount to validate the intermediate results. The general concern is that you don't want neither too small training set, nor too small test or validation set:

There are two competing concerns: with less training data, your parameter estimates have greater variance. With less testing data, your performance statistic will have greater variance. Broadly speaking you should be concerned with dividing data such that neither variance is too high, which is more to do with the absolute number of instances in each category rather than the percentage.

If your dataset is small and you need to decide on sizes of the subsamples, check the Can I use a tiny Validation set? thread.

One additional concern is that you would be looking at the test set only once, while you could be checking the validation set metrics multiple times. This could lead to overfitting to the validation set (cherry-picking a result that works well for validation set). This may be an argument not to have a tiny validation set, as it would be easier to overfit to smaller than a bigger one.

Tim
  • 108,699
  • 20
  • 212
  • 390