Tuning hyperparameters with simulated data, do I need to use cross-validation or can I just give it simulated data sets from different seeds?

Question

I am doing a method comparison of some machine learning models across certain scenarios. I simulated data where associations are known. To me, this seems like a simple way to have as much data as I want to train, tune, and test models (over and above the obvious benefit of knowing the exact structure of the data).

However, the idea of k-fold cross validation during tuning is engrained and I wanted to ask others for input.

Can I train and tune these ML models by just using a different seed in simulation than the test set, or do I need to use k-fold cross validation?

Thanks

score 1 · Accepted Answer · answered Oct 21 '21 at 18:58

1

This is a good question and I believe you can tune, train and test by using different seeds, assuming the generated data each time is sufficiently (which is vaguely defined, I know) different. Since you have a generator for the data, this means you can have theoretically infinite number of data samples in your sets, assuming features can accommodate infinite combinations. Cross validation is like a simulation of different datasets generated by different seeds. It's typically used because the dataset at hand is finite.

answered Oct 21 '21 at 18:58

gunes

49,700
3
39
75

1

Thanks for the reply! This is what I was thinking, but I wanted to have someone else's input. If the data were too similar, the model would just be unrealistically accurate. If that's the case, then I can increase the standard deviation of the distributions I'm using to simulate. – ChristopherLoan Oct 22 '21 at 16:19

Tuning hyperparameters with simulated data, do I need to use cross-validation or can I just give it simulated data sets from different seeds?

1 Answers1