2

I am doing a method comparison of some machine learning models across certain scenarios. I simulated data where associations are known. To me, this seems like a simple way to have as much data as I want to train, tune, and test models (over and above the obvious benefit of knowing the exact structure of the data).

However, the idea of k-fold cross validation during tuning is engrained and I wanted to ask others for input.

Can I train and tune these ML models by just using a different seed in simulation than the test set, or do I need to use k-fold cross validation?

Thanks

1 Answers1

1

This is a good question and I believe you can tune, train and test by using different seeds, assuming the generated data each time is sufficiently (which is vaguely defined, I know) different. Since you have a generator for the data, this means you can have theoretically infinite number of data samples in your sets, assuming features can accommodate infinite combinations. Cross validation is like a simulation of different datasets generated by different seeds. It's typically used because the dataset at hand is finite.

gunes
  • 49,700
  • 3
  • 39
  • 75
  • 1
    Thanks for the reply! This is what I was thinking, but I wanted to have someone else's input. If the data were too similar, the model would just be unrealistically accurate. If that's the case, then I can increase the standard deviation of the distributions I'm using to simulate. – ChristopherLoan Oct 22 '21 at 16:19