34

In almost all of the analysis work that I've ever done I use:

set.seed(42) 

It's an homage to Hitchhiker's Guide to the Galaxy. But I'm wondering if I'm creating bias by using the same seed over and over.

amoeba
  • 93,463
  • 28
  • 275
  • 317
Brandon Bertelsen
  • 6,672
  • 9
  • 35
  • 46
  • 9
    How do you use seed? Depending on your typical use case the answer ranges from yes to no. – Momo Dec 23 '13 at 16:34
  • Brandon, what if people reply to you YES? What will you do? I feel apprehensive. – ttnphns Dec 23 '13 at 19:39
  • @Momo Let's just say that I always set it, for fear of forgetting it and being unable to reproduce my results. This is across independent and different types of experiements. I'd appreciate understanding both yes and no cases. – Brandon Bertelsen Dec 23 '13 at 20:57
  • @ttnphns Treat it like a lesson learned? – Brandon Bertelsen Dec 23 '13 at 21:03
  • It is OK for the purpose to reproduce results, whether they are biased or not. But unless your sample size (number of independent experiments or observations) produced under that seed approaches infifnity, some bias will persist. Note two more important things: we usually use _pseudo_ random generators which complicates consequences for you. The consequenses also depend on what type of random generator you use (e.g. Mersenne twister or what?). Thus, for serious trials of something random it's always better to set seed to random. – ttnphns Dec 24 '13 at 00:17
  • Depends what you are doing with it. For example, delayed version of the a particular random process summed with the original generates different random process, unlike two truly random processes, which retain the distribution when delayed and summed. – Dole Apr 08 '16 at 22:17
  • Related: [If so many people use set.seed(123) doesn't that affect randomness of world's reporting?](https://stats.stackexchange.com/q/205961/1352) – Stephan Kolassa Jul 10 '18 at 06:15

2 Answers2

37

There is no bias if the RNG is any good. By always using the same seed you are, however, creating a strong interdependence among all the simulations you perform in your career. This creates an unusual kind of risk.

  • By using the same seed each time, either you are always getting a pretty nice pseudorandom sequence and all your work goes well or--with very low but non-zero probability--you are always using a pretty bad sequence and your simulations are not as representative of the underlying distributions as you think they might be. Either all your work is pretty good or all of it is pretty lousy!

  • Contrast this with using truly random starting seeds each time. Once in a very long while you might obtain a sequence of random values that is not representative of the distribution you are modeling, but most of the time you would be just fine. If you never attempted to reproduce your own work (with a new seed), then once or twice in your career you might get misleading results, but the vast majority of the time you will be ok.

There is a simple and obvious cure: Always, always check your work by restarting with another seed. It's virtually impossible that two seeds accidentally will give misleading results in the same way.

On the other hand, there is extraordinary merit in having a well-known "personal seed": it shows the world you are being honest. A sly, subtle way to lie with simulations is to repeat them until they give you a predetermined outcome. Here's a working R example to "demonstrate" that even a fair coin is highly likely to land heads more than half the time:

n.flips <- 100
seeds <- 1:10^3
#
# Run some preliminary simulations.
#
results <- sapply(seeds, function(seed) {
  set.seed(seed)
  mean(runif(n.flips) > 1/2)
})
#
# Now do the "real" simulation.
#
seed <- seeds[which.max(results)]
set.seed(seed)
x <- mean(runif(n.flips) > 1/2)
z <- (x - 1/2) * 2 * sqrt(n)
cat("Mean:", x, "Z:", z, "p-value:", pnorm(z, lower.tail=FALSE), "\n")

By looking at a wider range of seeds (from $1$ through $10^6$), I was able to find a congenial one: 218134. When you start with this as the seed, the resulting $100$ simulated coin flips exhibit $75$ heads! That is significantly different from the expected value of $50$ ($p=0.000004$).

The implications can be fascinating and important. For instance, if I knew in advance whom I would be recruiting into a randomized double-blind controlled trial, and in what order (which I might be able to control as a university professor testing a group of captive undergraduates or lab rats), then beforehand I could run such a set of simulations to find a seed that groups the students more to my liking to favor whatever I was hoping to "prove." I could include the planned order and that seed in my experimental plan before conducting the experiment, thereby creating a procedure that no critical reviewer could ever impeach--but nevertheless stacking the deck in my favor. (I believe there are entire branches of pseudoscience that use some variant of this trick to gain credibility. Would you believe I actually used ESP to control the computer? I can do it at a distance with yours, too!)

Somebody whose default seed is known cannot play this game.

My personal seed is 17, as a large proportion of my posts attest (currently 155 out of 161 posts that set a seed use this one). In R it is a difficult seed to work with, because (as it turns out) most small datasets I create with it have a strong outlier. That's not a bad characteristic ... .

whuber
  • 281,159
  • 54
  • 637
  • 1,101
  • 5
    Did you pick $17$ *because* of that property, or is that just a nice coincidence? – Matthew Drury Jun 18 '15 at 22:17
  • 5
    @Matthew It goes back to a group of high school students with a shared interest in math who were studying number theory one summer long ago. One, as I recall, jokingly proposed 17 as the archetypical integer and our group quickly found many rationalizations for this, some of mathematical interest and some merely humorous (at least from the point of view of a math nerd). There are interesting mathematical and historical properties of this number that single it out for attention, such as Gauss's discovery of the constructibility of the 17-gon. `R`'s behavior is purely accidental. – whuber Jun 18 '15 at 22:27
  • 4
    @Matthew BTW, my seed is related to Brandon's: there are precisely 42 ordered pairs of distinct integral primes of size 17 or less :-). – whuber Jun 18 '15 at 22:36
  • 8
    I used to be able to construct a 17-gon with ruler and compass as a party trick. Well, for the right definition of *party* I guess... – Matthew Drury Jun 18 '15 at 22:40
  • 1
    @MatthewDrury they poppin' bottles at your party. – Brandon Bertelsen Jun 18 '15 at 22:42
  • @whuber excellent answer, as always. re: Adjusting your seed for "better" results - how bad is this really? For example, on your impetus, I just played with an old segmentation (straightforward kmeans) I worked on. Changing the seed made minor changes in the profile, size, and order of the groups but the overall story that would follow is the same, but in a few cases would have made some of my previous findings somwhat more powerful (we're talking a few %pts, but regardless) – Brandon Bertelsen Jun 18 '15 at 22:55
  • 1
    For large simulations one would hope the seed makes no difference. But it all depends on details of the simulation. A few years ago I made a mathematical error in a derivation that caused a result to be wrong by just a tiny bit in the case I was interested in. I detected this only by noticing that Z-values in a very large simulation were just a little large on average. It was crucial to know that this tiny little signal was not an artifact of the pseudo RNG I was using. – whuber Jun 19 '15 at 13:34
  • https://www.youtube.com/watch?v=87uo2TPrsl8 @MatthewDrury – Brandon Bertelsen Mar 08 '19 at 20:05
2

As stated above, a good RNG will not generate bias under from using the same seed. However, there will be a correlation among the results. (The same pseudo-random number will start each computation.) Whether this matters isn't a matter of mathematics.

Using the same seed is OK at times: for debugging or when you know you want correlated results.

ttw
  • 161
  • 3