I vaguely recall from grad school that the following is a valid approach to do a weighted sampling without replacement:
- Start with an initially empty "sampled set".
- Draw a (single) weighted sample with replacement with whatever method you have.
- Check whether you have already picked it.
- If you did, ignore it and move to the next sample. If you didn't, add it to your sampled set.
- Repeat 2-4 until the sampled set is of the desired size.
Will this give me a valid weighted sample?
(Context: for very large samples, R's sample(x,n,replace=FALSE,p)
takes a really long time, much longer than implementing the strategy above.)
UPDATE 2012-01-05
Thank you everyone for your comments and responses! Regarding code, here's an example that faithfully reproduces what I'm trying to do (although I'm actually calling R from Python with RPy, but it seems the runtime characteristics are unchanged).
First, define the following function, which assumes a sufficiently long sequence of with-replacement samples as input:
get_rejection_sample <- function(replace_sample, n) {
top = length(replace_sample)
bottom = 1
mid = (top+bottom)/2
u = unique(replace_sample[1:mid])
while(length(u) != n) {
if (length(u) < n) {
bottom = mid
}
else {
top = mid
}
mid = (top+bottom)/2
u = unique(replace_sample[1:mid])
if (mid == top || mid == bottom) {
break
}
}
u
}
It simply does a binary search for the number of non-unique samples needed to get enough unique samples. Now, in the interpreter:
> x = 1:10**7
> w = rexp(length(x))
> system.time(y <- sample(x, 500, FALSE, w))
user system elapsed
12.820 0.048 12.902
> system.time(y <- get_rejection_sample(sample(x, 1000, TRUE, w), 500))
user system elapsed
0.311 0.066 0.378
Warning message:
In sample(x, 1000, TRUE, w) :
Walker's alias method used: results are different from R < 2.2.0
As you can see, even for just 500 elements the difference is dramatic. I'm actually trying to sample $10^6$ indices from a list of about $1.2 \times 10^7$, which takes hours using replace=FALSE
.