4

I try to do a ridge and lasso regression for out of sample predictions. The optimal lambda is chosed via cross validation. I run my results for different seeds in R. And depending on the seed i get a different result.

Now i ask myself do efficiency of this shrinkaged methods realy depend on the seed? I mean do i have to find the right seed to produce an modell that gives me the smallest mse?

Dima Ku
  • 341
  • 2
  • 12
  • 1
    A closely related question is addressed at https://stats.stackexchange.com/questions/80407 : you might enjoy reading that thread, too. – whuber Jul 09 '18 at 20:13
  • Related to @whuber's comment: [If so many people use set.seed(123) doesn't that affect randomness of world's reporting?](https://stats.stackexchange.com/q/205961/1352) – Stephan Kolassa Jul 10 '18 at 06:15

1 Answers1

4

Each time you run a cross validation you get an estimate of the true population test error. Different random seeds give you different estimates of this population quantity, but each is an estimate of the same underlying truth.

If you re-seed the random number generator in a quest to drive down the estimated test error you are going to bias your estimate of the test error downwards (this is sometimes called seed-hacking, and is a bad practice). If you take your final test error estimate to be the minimal you have seen over many runs of the random number generator, then you're definitely going to end up with a very optimistic estimate of the test error, and should not be surprised when your model performs much worse in production.

Instead, pick a seed and stick with it. My personal seed is 154.

Matthew Drury
  • 33,314
  • 2
  • 101
  • 132