0

I have a dataset of 1000 elements. I am doing random subsampling validation with different ratios for the train/test sets (90/10%, 80/20%, 10/90%), for each ratio I generate 100 train/test samples. My question is how to fairly compare the results given by using the different ratios (as the testing sizes are different for each ratio)? does it even make sense to compare different ratios? My intention is not necessary to provide a model that does the estimations, but to say that anyone can built a model, but only would need small training set (say 10%) to make an estimation of a bigger set with such and such uncertainty.

Any hint would be very much appreciated.

Bur Nor
  • 13
  • 3
  • 1000 observations is to few for splitting the data, go for cross validation, see https://stats.stackexchange.com/questions/50609/validation-data-splitting-into-training-vs-test-datasets or https://stats.stackexchange.com/questions/509080/cross-validation-with-gridsearchcv-or-train-val-test-split – kjetil b halvorsen Jul 18 '21 at 17:07
  • @kjetilbhalvorsen Thanks. After educating myself a little bit on the topic. I can say that what I am doing is random subsampling validation with different ratios for the train/test sets. My question is how to compare the results given by the different ratios? does it even make sense to compare different ratios? My intention is not necessary to provide a model that does prediction, but to say that anyone can built a model with their data, but only would need small training set (say 10%) to make a prediction with such and such certainty. – Bur Nor Jul 19 '21 at 12:28
  • 1
    Please, do not delete a question and then ask it again! Work with this question, make it better so that hopefully someone can answer! – kjetil b halvorsen Jul 19 '21 at 21:48

1 Answers1

0

If you want to prove that your model only needs 10% of the data for training, then there are multiple options.

1.

  • fix a validation set of whatever size.
  • from the rest of the data choose a random 10% and measure performance on validation set.
  • repeat this step a few times (which performance metric to use is also important here for example whether you can calculate a p-value and FDR)
  • compare average performance of that vs when you choose a random 20% (or other size) of non-validation data
  1. same thing as one except at every iteration choose a random validation set (always the same size)

Basically - to compare performance of the model on different training set sizes, you should keep the validation set the same size

user3494047
  • 498
  • 3
  • 13
  • thanks. One question, do you see problematic to compare validation sets of different sizes? (for instance, using the mean of for example the errors) If so, what would be the problem exactly? – Bur Nor Jul 20 '21 at 21:51
  • yes I do. It is difficult to interpret comparisons of performance on different validation sets and especially different validation sets of different sizes. I that is why i suggest using the same validation set even when the training set size is different or an approach of averaging performance with validation sets which are the same size – user3494047 Jul 22 '21 at 06:13