1

I am using Shao's 1993 article "Linear Model Selection by Cross-Validation" as a starting point for the following cross-validation strategy for a machine learning algorithm:

  1. NOTE: I am actually generating all of the data in this method using pseudo-random number generators based on assumptions about the statistical properties of the physical environment and method in which it would be generated in real life
  2. Randomly select training examples from a very large data set, and randomly select other samples for validation from a partitioned-off part of the same data set (usually quite a few times larger than the training partition)
  3. After training for at least as many epochs as the total number of training + validation examples seen in each, store a validation metric (e.g., precision-recall AUC) as a measure of algorithm performance for the epoch with lowest validation loss
  4. Repeat the entire process above for multiple configurations of the algorithm (e.g., differing numbers of "neurons", different training losses, etc.)
  5. Select the algorithm with the best performance, then train it on all available data until training loss starts to increase

The 1993 paper shows without question (in my view) the superiority of Monte Carlo cross-validation (MCCV) for linear model selection. That being said, I am now applying the methods of the paper to non-linear algorithms/models, and I would like to know if anyone is aware of research in this area which would cause me to question the validity of my approach.

If possible, please ignore the implications of which particular pseudo-random number generation process results in the random selection of step 1.

brethvoice
  • 117
  • 1
  • 11
  • 1
    you have a very small dataset, 40 samples. nothing is really going to give you confidence in the outcomes. – Aksakal Mar 29 '21 at 16:25
  • @user36041 I updated the question to be less specific about training data set size and focus more on the relative sizes for training and validation. Based on Shao's paper it would appear that when selecting a model the validation set could always be larger, and usually much larger, than the training set. – brethvoice Mar 30 '21 at 13:31
  • The trouble is that absolute numbers matter. If your data set is really 40 samples then it’s just a loss cause for any machine learning technique. Nothing can be done – Aksakal Mar 30 '21 at 13:33
  • @user36041 could we go with "very large" meaning large enough to not be a lost cause for machine learning? I am dealing with practically infinite data sets which are artificially generated at the moment, for example. The thing I am focusing on is Shao's formula where he suggests letting the number of validation data points grow exponentially larger than the number of training data as explained here: https://datascience.stackexchange.com/q/87266/93564 – brethvoice Mar 30 '21 at 18:32
  • 1
    i don't think there's a universal optimal ratio of training/test sets. actually, paying too much attention to test set is very dangerous in a sense that at some point you won't notice how your test silently became your training set. in fact, the way most people use the test sets makes it questionable as to whether the test set is really different from the training. – Aksakal Mar 30 '21 at 19:29

0 Answers0