How do I sample without replacement using a sampling-with-replacement function?

Question

I vaguely recall from grad school that the following is a valid approach to do a weighted sampling without replacement:

Start with an initially empty "sampled set".
Draw a (single) weighted sample with replacement with whatever method you have.
Check whether you have already picked it.
If you did, ignore it and move to the next sample. If you didn't, add it to your sampled set.
Repeat 2-4 until the sampled set is of the desired size.

Will this give me a valid weighted sample?

(Context: for very large samples, R's sample(x,n,replace=FALSE,p) takes a really long time, much longer than implementing the strategy above.)

UPDATE 2012-01-05

Thank you everyone for your comments and responses! Regarding code, here's an example that faithfully reproduces what I'm trying to do (although I'm actually calling R from Python with RPy, but it seems the runtime characteristics are unchanged).

First, define the following function, which assumes a sufficiently long sequence of with-replacement samples as input:

get_rejection_sample <- function(replace_sample, n) {
    top = length(replace_sample)
    bottom = 1
    mid = (top+bottom)/2
    u = unique(replace_sample[1:mid])
    while(length(u) != n) {
        if (length(u) < n) {
            bottom = mid
        }
        else {
            top = mid
        }
        mid = (top+bottom)/2
        u = unique(replace_sample[1:mid])
        if (mid == top || mid == bottom) {
            break
        }
    }
    u
}

It simply does a binary search for the number of non-unique samples needed to get enough unique samples. Now, in the interpreter:

> x = 1:10**7
> w = rexp(length(x))
> system.time(y <- sample(x, 500, FALSE, w))
   user  system elapsed 
 12.820   0.048  12.902 
> system.time(y <- get_rejection_sample(sample(x, 1000, TRUE, w), 500))
   user  system elapsed 
  0.311   0.066   0.378 
Warning message:
In sample(x, 1000, TRUE, w) :
  Walker's alias method used: results are different from R < 2.2.0

As you can see, even for just 500 elements the difference is dramatic. I'm actually trying to sample $10^6$ indices from a list of about $1.2 \times 10^7$, which takes hours using replace=FALSE.

Are you drawing all unique values? If not, how do you intend to check if you've already drawn that value if there may be multiple copies of that value in the data? — Fomite, Jan 04 '12 at 21:45
This looks correct, this is a rejection sampling. Can you provide R code with an example where this outperforms `sample`? This is interesting. — Elvis, Jan 04 '12 at 22:00
Elvis, I suspect at some point R must need to allocate too much RAM to carry out the partial permutation requested by `sample`. Profiling the machine might show a lot of disk thrashing from RAM paging. The alternative (rejection) sampling method can be carried out using only as much RAM as needed to store the indexes of the selected subset of `x`. With a good implementation (such as an associative dictionary to test for matches in step 3) it's not a whole lot slower than a good random-permutation algorithm ($O(n\log(n))$ compared to $O(n)$). — whuber, Jan 04 '12 at 22:20
@whuber I think the permutation can be done "in place", but you need to copy x before, is that the problem? I always thought R was lacking a good in place (partial) shuffle function. — Elvis, Jan 04 '12 at 22:36
Elvis, I believe you are right, but `R`'s approach generally is to make whole copies of objects for computations rather than to work on a single copy of an object, which leads me to suspect this may be at the root of the (alleged) problem. ("Alleged" means I haven't reproduced it, but I don't doubt that the asker is experiencing it.) — whuber, Jan 04 '12 at 22:39
It seems incorrect to make a statement about "R's approach". It is certainly possible to write code that would execute such a maneuver, but most sampling code I have seen has used sampling from indices and then subsetting from the objects, and that does not require copying of the full object. The methods are being too loosely described by Juan to offer directed strategies for improvement. We have as yet no idea what sort of "x" he is working on. — DWin, Jan 05 '12 at 05:58
@EpiGrad: Indeed I am drawing unique elements. Otherwise one can always draw a sample of indices instead of the original elements. — Juan, Jan 05 '12 at 23:18
@whuber: I don't think RAM is the problem. First because I have 70GB of RAM in my workstation ;), and second because if I monitor the process using top there's barely a blip in the RAM usage... If anyone knows what R is doing here I would be very curious to know... — Juan, Jan 05 '12 at 23:25
@Juan: Came across exactly the same problem you are seeing: http://stackoverflow.com/questions/15113650/faster-weighted-sampling-without-replacement. The reason seems to be nested `for` loops in the implementation of `sample.int`. -- Why don't you simply return `unique(replace_sample)[1:n]` in `get_rejection_sample`? — krlmlr, Feb 28 '13 at 08:11
@krlmlr: Good question/point! I've been mostly using Python/numpy recently. `numpy.unique` sorts the input, so your suggestion wouldn't work there. But it does in R! — Juan, Feb 28 '13 at 12:43

whuber · Accepted Answer · 2012-01-04T22:41:08.683

You describe a sequence of rejection samples.

By definition, after selecting $k$ objects in the process of obtaining a weighted sample of $m$ objects without replacement from a list of $n$ objects, you draw one of the remaining $n-k$ objects according to their relative weights. In your description of the alternative sampling scheme, at the same stage you will be drawing from all $n$ objects according to their relative weights but you will throw back any object equal to one of the first $k$ and try again: that's the rejection. So all you have to convince yourself of is that the chance of drawing one of the remaining $n-k$ objects is the same in either case.

If this equivalence isn't perfectly clear, there's a straightforward mathematical demonstration. Let the weights of the first $k$ objects be $w_1, \ldots, w_k$ and let the weights of the remaining $n-k$ objects be $w_{k+1}, \ldots, w_n$. Write $w = w_1 + w_2 + \cdots + w_n$ and let $w_{(k)} = w_1 + w_2 + \cdots + w_k$. The chance of drawing object $i$ ($k \lt i \le n$) without replacement from the $n-k$ remaining objects is of course

$$\frac{w_i}{w_{k+1} + w_{k+2} + \cdots + w_n} =\frac{w_i}{w-w_{(k)}}.$$

In the alternative (rejection) scheme, the chance of drawing object $i$ equals

$$\sum_{a=0}^\infty \left[\left(\frac{w_{(k)}}{w}\right)^a \frac{w_i}{w}\right] =\frac{1}{1 - w_{(k)}/w}\frac{w_i}{w} = \frac{w_i}{w-w_{(k)}}$$

exactly as before. The sum arises by partitioning the event by the number of rejections ($a$) that precede drawing element $i$; its terms give the chance of a sequence of $a$ draws from the first $k$ elements followed by drawing the $i^\text{th}$ element: because these draws are independent, their chances multiply. It forms a geometric series which is elementary to put into closed form (the first equality). The second equality is a trivial algebraic reduction.

score 0 · Answer 2 · answered Jan 05 '12 at 02:46

For this to be worth doing instead of using the sample function in R the vector you're sampling from needs to be about 1e7 or greater in size and the sample has to be relatively small.. say 1e2. If the sample you want is much bigger or the one you're sampling from is smaller sample will be faster. But once a tipping point is achieved a method like Juan describes will be much faster.

How do I sample without replacement using a sampling-with-replacement function?

2 Answers2

Linked