0

More specifically, in cases of bootstrap and cross-validation, we often tend to put a set.seed() randomly, either a number people like or more often 12 or 123. This has an influence on the outcome of the model and may change with changing set.seed() if the model is not robust enough.

So why do we want this random splitting of data and not just the same splitting of data everytime?

And, why do you want to set seed?

What good does it do prior to splitting data into train and test?

Thomas
  • 332
  • 1
  • 13
  • 6
    Closely related: https://stats.stackexchange.com/questions/80407, https://stats.stackexchange.com/questions/120371, https://stats.stackexchange.com/questions/58890, https://stats.stackexchange.com/questions/235232, https://stats.stackexchange.com/questions/335936, https://stats.stackexchange.com/questions/121225, *etc.* – whuber Sep 15 '20 at 16:49
  • Yeah that is indeed really related, thanks. I get that you should always go with one `set.seed` then, but a follow-up question. If I switch between `set.seed` in a large data set (approx. 1 million datapoints) where I split to train and test, it should not change the outcome so much, right? – Thomas Sep 15 '20 at 17:31
  • 2
    I find it's wise, whenever possible, to replicate an analysis using a different seed. If the results differ in any material way, you potentially have a problem and need to investigate it. – whuber Sep 15 '20 at 18:38

0 Answers0