Performance of Ridge and Lasso Regression depend on set.seed?

Question

I try to do a ridge and lasso regression for out of sample predictions. The optimal lambda is chosed via cross validation. I run my results for different seeds in R. And depending on the seed i get a different result.

Now i ask myself do efficiency of this shrinkaged methods realy depend on the seed? I mean do i have to find the right seed to produce an modell that gives me the smallest mse?

A closely related question is addressed at https://stats.stackexchange.com/questions/80407 : you might enjoy reading that thread, too. — whuber, Jul 09 '18 at 20:13
Related to @whuber's comment: [If so many people use set.seed(123) doesn't that affect randomness of world's reporting?](https://stats.stackexchange.com/q/205961/1352) — Stephan Kolassa, Jul 10 '18 at 06:15

score 4 · Answer 1 · answered Jul 09 '18 at 19:56

4

Each time you run a cross validation you get an estimate of the true population test error. Different random seeds give you different estimates of this population quantity, but each is an estimate of the same underlying truth.

If you re-seed the random number generator in a quest to drive down the estimated test error you are going to bias your estimate of the test error downwards (this is sometimes called seed-hacking, and is a bad practice). If you take your final test error estimate to be the minimal you have seen over many runs of the random number generator, then you're definitely going to end up with a very optimistic estimate of the test error, and should not be surprised when your model performs much worse in production.

Instead, pick a seed and stick with it. My personal seed is 154.

answered Jul 09 '18 at 19:56

Matthew Drury

33,314
2
101
132

Personally, I roll several 10-sided die each time I need to make a reproducible seed. – Sycorax Jul 09 '18 at 20:02
The real question is if you sum the results or concatenate them... – Matthew Drury Jul 09 '18 at 20:21
1

Lol. Concatenate. – Sycorax Jul 09 '18 at 20:23
@MatthewDrury maybe i can solve the problem with the seed if is use reapeted cross validation? – Dima Ku Jul 10 '18 at 09:13
I'm unsure what problem you are solving with that? Are you looking for an estimate of the variance of your test error estimate? – Matthew Drury Jul 10 '18 at 13:55

Performance of Ridge and Lasso Regression depend on set.seed?

1 Answers1