I am using Shao's 1993 article "Linear Model Selection by Cross-Validation" as a starting point for the following cross-validation strategy for a machine learning algorithm:
- NOTE: I am actually generating all of the data in this method using pseudo-random number generators based on assumptions about the statistical properties of the physical environment and method in which it would be generated in real life
- Randomly select training examples from a very large data set, and randomly select other samples for validation from a partitioned-off part of the same data set (usually quite a few times larger than the training partition)
- After training for at least as many epochs as the total number of training + validation examples seen in each, store a validation metric (e.g., precision-recall AUC) as a measure of algorithm performance for the epoch with lowest validation loss
- Repeat the entire process above for multiple configurations of the algorithm (e.g., differing numbers of "neurons", different training losses, etc.)
- Select the algorithm with the best performance, then train it on all available data until training loss starts to increase
The 1993 paper shows without question (in my view) the superiority of Monte Carlo cross-validation (MCCV) for linear model selection. That being said, I am now applying the methods of the paper to non-linear algorithms/models, and I would like to know if anyone is aware of research in this area which would cause me to question the validity of my approach.
If possible, please ignore the implications of which particular pseudo-random number generation process results in the random selection of step 1.