R: Problem with runif: generated number repeats (more often than expected) after less than 100 000 steps

Question

After executing the code

RNGkind(kind="Mersenne-Twister")  # the default anyway
set.seed(123)
n = 10^5
x = runif(n)
print(x[22662] == x[97974])

TRUE is output!

If I use, e.g., RNGkind(kind="Knuth-TAOCP-2002") similarly happens: I get "only" 99 995 different values in x. Given the periods of both random generators, the results seem highly unlikely.

Am I doing something wrong? I need to generate at least one million random numbers.

I am using Windows 8.1 with R version 3.6.2; Platform: x86_64-w64-mingw32/x64 (64-bit) and RStudio 1.2.5033.

Additional findings:

Having a bag with $n$ different balls, we choose a ball $m$ times and put it back every time. The probability $p_{n, m}$ that all chosen balls are different is equal to ${n\choose m} / (n^m m!)$.
R documentation points to a link where the implementation of Mersenne-Twister for 64-bit machines is available: http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/emt64.html

The uniform sampling from $[0, 1]$ interval is obtained via choosing a random 64-bit integer first, so I computed the above probabilities for the 64-bit and (when $p_{2^{64}, 10^5}$ turned out to be rather low) 32-bit case: $$ p_{2^{64}, 10^5}\doteq 0.9999999999972... \qquad p_{2^{32}, 10^5} \doteq 0.3121... $$

Then, I tried 1000 random seeds and compute the proportion of the cases when all generated numbers are different: 0.303.

So, currently, I assume that for some reason, 32-bit integers are actually used.

this question surprised me a lot and i cannot find the reason why!!! — Haitao Du, May 10 '20 at 11:02
Asking for uniqueness of the values would imply non-independence, thus non-randomness. — Michael M, May 10 '20 at 10:37
@Michael Sure. However, the probability of this happening is rather low, if I understand everything correctly. http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/emt64.html gives the implementation of the default method and it seems that 64-bit integers are used. Then, I went to compute the probability of choosing $m$ different examples out of $n$ options ($n = 2^{64}$, $m = 10^5$): ${n \choose m} / (n^m \cdot m!)$. I used wolfram alpha and the result is quite high: 0.9999999997289... — Antoine, May 10 '20 at 11:06
Very interesting findings, thanks for digging into it. Btw I was the first with a +1 for this question. — Michael M, May 10 '20 at 11:44
Somebody showed why this is happening, R uses 32-bit number https://arxiv.org/abs/2003.08009. Also https://twitter.com/johnmyleswhite/status/1252387405954289665?s=20 — David, May 11 '20 at 00:55
Could you include what you intend to use the pseudorandom numbers for? This need not be a problem depending on the application. — Frans Rodenburg, May 11 '20 at 04:22
@FransRodenburg I wanted to test some other code and the case where the numbers repeat is trickier, so it seemed reasonable (though not necessary in theory) to start with the case where the numbers do not repeat. — Antoine, May 11 '20 at 07:51

L_W · Accepted Answer · 2020-05-11T13:37:36.797

The documentation of R on random number generation has a few sentences at its end, that confirm your expectation of 32-bit integers being used and might explain what you are observing:

Do not rely on randomness of low-order bits from RNGs. Most of the supplied uniform generators return 32-bit integer values that are converted to doubles, so they take at most 2^32 distinct values and long runs will return duplicated values (Wichmann-Hill is the exception, and all give at least 30 varying bits.)

So the implementation in R seems to be different to what is explained on the website of the authors of the Mersenne Twister. Possibly combining this with the Birthday paradox, you would expect duplicates with only 2^16 numbers at a probability of 0.5, and 10^5 > 2^16. Trying the Wichmann-Hill algorithm as suggested in the documentation:

RNGkind(kind="Wichmann-Hill") 
set.seed(123)
n = 10^8
x = runif(n)
length(unique(x))
# 1e8

Note that the original Wichmann-Hill random number generator has the property that its next number can be predicted by its previous, and therefore does not meet non-predictability requirements of a valid PRNG. See this document by Dutang and Wuertz, 2009 (section 3)

Minor nit: I might replace, in your last paragraph, "...output tobe truly random numbers" with " ... output to meet non-predictability requirements of a valid PRNG..." — Carl Witthoft, May 11 '20 at 11:04

Henry · Answer 2 · 2020-05-11T07:53:35.603

14

Just to emphasise the arithmetic of the $2^{32}$ point in terms of the number of potential distinct values: if you sample $10^5$ times from $2^{32}$ values with replacement, you would expect an average of $2^{32}\left(1-\left(1-\frac{1}{2^{32}}\right)^{10^5}\right) \approx 10^5 - 1.1634$ distinct values, noting that $\frac{(10^5)^2}{2 \times 2^{32}} \approx 1.1642$ is close to this deficit

So you would expect many earlier examples. There are two pairs with set.seed(1):

n = 10^5
set.seed(1)
x = runif(n)
x[21101] == x[56190]
x[33322] == x[50637]

If you do something similar to the first $2000$ seeds in R for the default runif you get an average of $10^5 - 1.169$ unique values, which is close to the calculated expectation. Only $30.8\%$ of these seed produce no duplicates from the sample of $10^5$

Sample $10^6$ times and you would expect the issue to be about a hundred times worse and indeed average number of unique values for the first $2000$ seeds is $10^6 - 116.602$ and none of these seeds failed to produce duplicates

There is another way of reducing the likelihood of overlaps while still having a uniform distribution: try pnorm(rnorm(n))

  set.seed(123)
  n = 10^8
  x = runif(n) 
  length(unique(x))
  # 98845390
  y = pnorm(rnorm(n))
  length(unique(y))
  # 100000000

edited May 11 '20 at 07:53

answered May 11 '20 at 01:07

Henry

30,848
1
63
107

Is `rnorm` using 64bit variables? While that is an ingenious solution, it is a bit strange workaround. There should be a simple `runif` call doing the same. Maybe we should make `runif` able to do the same. (In the other answer it is mentioned that 'most algorithms do this' so maybe it should be stated more explicitly *which* algorithms do this). – Sextus Empiricus May 11 '20 at 08:40
@SextusEmpiricus something slightly peculiar is going on with that alternative: see `n=100; set.seed(123); x=runif(2*n); x=x[2*(1:n)-1]; set.seed(123); y=pnorm(rnorm(n)); summary(x-y)` and the differences are tiny but typically bigger than $2^{-32}$ – Henry May 11 '20 at 15:45
@Henry you sure you aren't just running into floating-point precision errors? In addition, the `rnorm` function is not as simple as your "normifier" there. – Carl Witthoft May 11 '20 at 19:04
@CarlWitthoft `rnorm` is not exactly `qnorm(runif)` but it is very close using the default methods: try `set.seed(2020); a=rnorm(5); set.seed(2020); b=qnorm(runif(10))[c(1,3,5,7,9)]; cbind(a,b)` to see what I mean – Henry May 11 '20 at 20:51

score 1 · Answer 3 · answered May 11 '20 at 11:12

There are two problems here. The first has been well-covered in the other answers, to wit: why do duplicates show up for certain configurations of the input arguments.

The other is very important: There is a big difference between "random with replacement" and " random permutation of a known set. " Mathematically, it's completely valid for random integer sequence to contain, e.g., 6,6,6,6,6 . Most PRNGs fail to do a complete "replacement" in their algorithm, so what we end up with is much closer to (but not exactly, as the example in the posted question shows) a random permutation of the set of values. In fact, since most PRNGs generate the next value based on the current (and possible a few previous) value, they are almost Markov processes. We call them "random" because we agree that an outside observer cannot determine the generator algorithm, so the next number to show up is unpredictable to that observer.

Consider, then, the difference between runif and sample , where the latter has an argument explicitly directing whether to select with or without replacement.

score 1 · Answer 4 · edited May 12 '20 at 13:21

Even though it is counter intuitive, there are good reasons that explain this phenomenon, essentially because a computer uses finite precision. A preprint has just been posted (March 2020) on ArXiv (as already mentioned in the discussion, by the way) and treats this question thoroughly. It has been written by an experienced researcher in computational statistics (not me nor a friend of mine) and uses R. All the codes are reproducible and you can easily check the codes and the claims by yourself. Just to cite a few lines (first lines of the Conclusion) of the conclusion that seem to answer your question:

Rather unintuitively (but, as it turns out, not unexpectedly), generating random numbers can lead to ties. For generating $n$ random numbers on a $k$-bit architecture, we showed that the expected number of ties is $n-2^{k}(1-(1-2^{-k})^{n})$. Furthermore, we derived a numerically robust formula to compute this number. For a 32-bit architecture as is still used in random number generators (be it for historical reasons, reproducibility or due to run time), the expected number of ties when generating one million random numbers is 116.

The cited version is the one posted on 18th March 2020.

https://arxiv.org/abs/2003.08009

R: Problem with runif: generated number repeats (more often than expected) after less than 100 000 steps

4 Answers4