How to program the deleted d-jack-knife

Question

I have a dataset containing 45 observations. I want to sample times from this dataset, but with sample size equal to 35 each time. So each time I want to delete 10 datapoints from the original dataset. In total there are ${45}\choose{10}$ possible ways to delete d points from the sample. I want all these possibilities to be exploited during the resampling.

Can someone help me to program this? I must add that the code ${45}\choose{10}$ seems not to work in R because it exceeds the memory capacity.

Many thanks in advance!

I use the code combn(45,10) which gives following error: Error: cannot allocate vector of size 23.8 Gb In addition: Warning messages: 1: In vector("list", count) : Reached total allocation of 4095Mb: see help(memory.size) 2: In vector("list", count) : Reached total allocation of 4095Mb: see help(memory.size) 3: In vector("list", count) : Reached total allocation of 4095Mb: see help(memory.size) 4: In vector("list", count) : Reached total allocation of 4095Mb: see help(memory.size) — Pieter, Feb 12 '15 at 09:33
But probably there is a way to program the situation I described in anthoer way without using that code — Pieter, Feb 12 '15 at 09:34
Well probably I'll just restrict it to 1000 samples. I want to estimate the standard error of of an estimate based on the deleted-d-jackknife bootstrap. Concretely my question is how to program a deleted-d-jackknife bootstrap. Surprisingly it is easy to program a jackknife bootstrap, but not a deleted-d-jackknife bootstrap — Pieter, Feb 12 '15 at 10:40

score 3 · Answer 1 · answered Aug 21 '15 at 20:56

It is rare in statistical analysis that such code is truly needed (and it is not needed for the Jackknife), but here (for the record) is a solution.

A reliable method to generate a non-duplicating collection of $k$-subsets of $n$ things is to associate a unique combination with each integer $x$ in the set $\{0, 1, \ldots, \binom{n}{k}-1\}$. This can be done by walking upwards in Pascal's Triangle back to its apex, exploiting the relation $\binom{n}{k} = \binom{n-1}{k-1} + \binom{n-1}{k}$, and picking elements at each row where you move left (from $k$ to $k-1$).

Start with the triple $(x, n, k)$ with $0 \le x \lt \binom{n}{k}$ and $0 \le k \le n$. When $x \lt \binom{n-1}{k-1}$, select the first of $n$ elements, decrement $x$ to $$x^\prime = x - \binom{n-1}{k-1},$$ and proceed to select $k-1$ out of $n-1$ elements based on the triple $(x^\prime, n-1, k-1)$. Otherwise do not select the first of $n$ elements, do not change $x$, and proceed to select $k$ out of $n-1$ elements based on the triple $(x, n-1, k)$. In any event, if $n = k$, then pick all $k$ elements.

The proof that this works depends on some easily established invariants:

At the outset, $0 \le x \lt \binom{n}{k}$.
Exactly $k$ elements will be picked.

You can check that the following code satisfies these invariants.

choose.int <- function(x, n, k) {
  if(n <= k) return(rep(TRUE, k))
  u <- choose(n-1, k-1)
  pick <- x < u
  if (pick) y <- choose.int(x, n-1, k-1) else y <- choose.int(x-u, n-1, k)
  return(c(pick, y))
}

The proof that it generates all such subsets, without duplication, can be accomplished with an inductive argument on $n$.

Thus, to select (say) $1000$ unique $10$-subsets of $45$ things, obtain a random sample without replacement from the set $\{0, 1, \ldots, \binom{45}{10}-1\}$. Using choose.int, convert each value $x$ in this set into a vector indicating which of the $45$ elements to select:

n <- 45; k <- 10
sample <- sapply(sample.int(choose(n, k), 1000)-1, choose.int, n=n, k=k)

Timing indicates about $10,000$ such subsets can be generated per second (because R is not efficient with recursive functions).

Because this is a largish array, let's illustrate with smaller values. How about picking $6$ of the $10 = \binom{5}{3}$ three-subsets of five things:

n <- 5; k <- 3
sapply(sample.int(choose(n, k), 6)-1, choose.int, n=n, k=k)

The output will vary with the random seed, but in one case it was

      [,1]  [,2]  [,3]  [,4]  [,5]  [,6]
[1,]  TRUE  TRUE FALSE  TRUE  TRUE FALSE
[2,]  TRUE FALSE  TRUE  TRUE FALSE  TRUE
[3,] FALSE FALSE  TRUE  TRUE  TRUE  TRUE
[4,]  TRUE  TRUE FALSE FALSE FALSE  TRUE
[5,] FALSE  TRUE  TRUE FALSE  TRUE FALSE

Each column indicates which elements to include in the sample. They all have $3$ TRUE values for the elements picked. No two of the columns are the same.

Tim · Answer 2 · 2015-07-22T08:20:13.550

Delete-d jackknife is not efficient method for this kind of cases. As you noticed, the number of possible combinations is ${n}\choose{d}$, and that is in your case 3190187286 possibilities. Function combn returns error since you ask it to generate a huge matrix of possible combinations and it does not fit the memory. You could try some workarounds like not generating the whole matrix of combinations but creating one combination at a time, computing some summary statistic and not preserving all the intermediate values. Notice however that also in this case you would have to loop through all the combinations and it would take a very long time. This is certainly not efficient. Why not trying bootstrap instead? Bootstrap generally needs much less iterations.

How to program the deleted d-jack-knife

2 Answers2

Linked