It is rare in statistical analysis that such code is truly needed (and it is not needed for the Jackknife), but here (for the record) is a solution.
A reliable method to generate a non-duplicating collection of $k$-subsets of $n$ things is to associate a unique combination with each integer $x$ in the set $\{0, 1, \ldots, \binom{n}{k}-1\}$. This can be done by walking upwards in Pascal's Triangle back to its apex, exploiting the relation $\binom{n}{k} = \binom{n-1}{k-1} + \binom{n-1}{k}$, and picking elements at each row where you move left (from $k$ to $k-1$).
Start with the triple $(x, n, k)$ with $0 \le x \lt \binom{n}{k}$ and $0 \le k \le n$. When $x \lt \binom{n-1}{k-1}$, select the first of $n$ elements, decrement $x$ to $$x^\prime = x - \binom{n-1}{k-1},$$ and proceed to select $k-1$ out of $n-1$ elements based on the triple $(x^\prime, n-1, k-1)$. Otherwise do not select the first of $n$ elements, do not change $x$, and proceed to select $k$ out of $n-1$ elements based on the triple $(x, n-1, k)$. In any event, if $n = k$, then pick all $k$ elements.
The proof that this works depends on some easily established invariants:
At the outset, $0 \le x \lt \binom{n}{k}$.
Exactly $k$ elements will be picked.
You can check that the following code satisfies these invariants.
choose.int <- function(x, n, k) {
if(n <= k) return(rep(TRUE, k))
u <- choose(n-1, k-1)
pick <- x < u
if (pick) y <- choose.int(x, n-1, k-1) else y <- choose.int(x-u, n-1, k)
return(c(pick, y))
}
The proof that it generates all such subsets, without duplication, can be accomplished with an inductive argument on $n$.
Thus, to select (say) $1000$ unique $10$-subsets of $45$ things, obtain a random sample without replacement from the set $\{0, 1, \ldots, \binom{45}{10}-1\}$. Using choose.int
, convert each value $x$ in this set into a vector indicating which of the $45$ elements to select:
n <- 45; k <- 10
sample <- sapply(sample.int(choose(n, k), 1000)-1, choose.int, n=n, k=k)
Timing indicates about $10,000$ such subsets can be generated per second (because R
is not efficient with recursive functions).
Because this is a largish array, let's illustrate with smaller values. How about picking $6$ of the $10 = \binom{5}{3}$ three-subsets of five things:
n <- 5; k <- 3
sapply(sample.int(choose(n, k), 6)-1, choose.int, n=n, k=k)
The output will vary with the random seed, but in one case it was
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] TRUE TRUE FALSE TRUE TRUE FALSE
[2,] TRUE FALSE TRUE TRUE FALSE TRUE
[3,] FALSE FALSE TRUE TRUE TRUE TRUE
[4,] TRUE TRUE FALSE FALSE FALSE TRUE
[5,] FALSE TRUE TRUE FALSE TRUE FALSE
Each column indicates which elements to include in the sample. They all have $3$ TRUE
values for the elements picked. No two of the columns are the same.