In almost all of the analysis work that I've ever done I use:
set.seed(42)
It's an homage to Hitchhiker's Guide to the Galaxy. But I'm wondering if I'm creating bias by using the same seed over and over.
In almost all of the analysis work that I've ever done I use:
set.seed(42)
It's an homage to Hitchhiker's Guide to the Galaxy. But I'm wondering if I'm creating bias by using the same seed over and over.
There is no bias if the RNG is any good. By always using the same seed you are, however, creating a strong interdependence among all the simulations you perform in your career. This creates an unusual kind of risk.
By using the same seed each time, either you are always getting a pretty nice pseudorandom sequence and all your work goes well or--with very low but non-zero probability--you are always using a pretty bad sequence and your simulations are not as representative of the underlying distributions as you think they might be. Either all your work is pretty good or all of it is pretty lousy!
Contrast this with using truly random starting seeds each time. Once in a very long while you might obtain a sequence of random values that is not representative of the distribution you are modeling, but most of the time you would be just fine. If you never attempted to reproduce your own work (with a new seed), then once or twice in your career you might get misleading results, but the vast majority of the time you will be ok.
There is a simple and obvious cure: Always, always check your work by restarting with another seed. It's virtually impossible that two seeds accidentally will give misleading results in the same way.
On the other hand, there is extraordinary merit in having a well-known "personal seed": it shows the world you are being honest. A sly, subtle way to lie with simulations is to repeat them until they give you a predetermined outcome. Here's a working R
example to "demonstrate" that even a fair coin is highly likely to land heads more than half the time:
n.flips <- 100
seeds <- 1:10^3
#
# Run some preliminary simulations.
#
results <- sapply(seeds, function(seed) {
set.seed(seed)
mean(runif(n.flips) > 1/2)
})
#
# Now do the "real" simulation.
#
seed <- seeds[which.max(results)]
set.seed(seed)
x <- mean(runif(n.flips) > 1/2)
z <- (x - 1/2) * 2 * sqrt(n)
cat("Mean:", x, "Z:", z, "p-value:", pnorm(z, lower.tail=FALSE), "\n")
By looking at a wider range of seeds (from $1$ through $10^6$), I was able to find a congenial one: 218134. When you start with this as the seed, the resulting $100$ simulated coin flips exhibit $75$ heads! That is significantly different from the expected value of $50$ ($p=0.000004$).
The implications can be fascinating and important. For instance, if I knew in advance whom I would be recruiting into a randomized double-blind controlled trial, and in what order (which I might be able to control as a university professor testing a group of captive undergraduates or lab rats), then beforehand I could run such a set of simulations to find a seed that groups the students more to my liking to favor whatever I was hoping to "prove." I could include the planned order and that seed in my experimental plan before conducting the experiment, thereby creating a procedure that no critical reviewer could ever impeach--but nevertheless stacking the deck in my favor. (I believe there are entire branches of pseudoscience that use some variant of this trick to gain credibility. Would you believe I actually used ESP to control the computer? I can do it at a distance with yours, too!)
Somebody whose default seed is known cannot play this game.
My personal seed is 17, as a large proportion of my posts attest (currently 155 out of 161 posts that set a seed use this one). In R
it is a difficult seed to work with, because (as it turns out) most small datasets I create with it have a strong outlier. That's not a bad characteristic ... .
As stated above, a good RNG will not generate bias under from using the same seed. However, there will be a correlation among the results. (The same pseudo-random number will start each computation.) Whether this matters isn't a matter of mathematics.
Using the same seed is OK at times: for debugging or when you know you want correlated results.