0

I'm interested in sampling a large dataset in nonconsecutive-record sequences of arbitrary length without overlap. I know from

How to take many samples of 10 from a large list, without replacement overall

how to do it if the sequence length is fixed (at 10 in that example), via

sample <- split(sample(datapoints), rep(1:(length(datapoints)/10+1), each=10))

How do I generalize this to emit uniformly-distributed sized samplings, not just sequences of 10?

Furthermore, can I specify that I want N such sequences?


For example, suppose I have the sequence d <- 1:20 and permute it via sample(d, 20, replace = F) to obtain another permuted sequence. Now I want to extract arbitrarily sized subsequences of this permuted sequence, say d1 <- c(1,5,4,15,3), d2<- c(18,7,12,11,19,16,10,8,14,17, 20, 13), d3 <- c(9,2,6) in a quick manner as in the example. My dataset is large, and I'd like to simply sample it once, and then split it without any length constraint as in the post I cite.

  • Are you familiar with the binomial distribution. You can randomly draw from a binomial to decide sample length and which chunks to draw. – Zachary Blumenfeld Dec 06 '15 at 00:34
  • Not sure I understand. The sampling has to be exhaustive, i.e. include each datapoint exactly once. – user1809593 Dec 06 '15 at 00:37
  • You need some process to decide your sample length and which data to sample. If you would like these things to not be fixed but rather allow them to vary randomly you must draw them from a descrete probability distribution. The binomial distribution is a good place to start – Zachary Blumenfeld Dec 06 '15 at 00:45
  • If you sample without replacement you just redefine the binomial to be over the leftover data after each draw. – Zachary Blumenfeld Dec 06 '15 at 00:48
  • When you say chunks of arbitrary length, do you want these chunks to contain consecutive records from your data set, or each to be chosen at random from the entire data set? ("Chunks of length..." sounds rather like you mean a consecutive run, though I don't think this is what you intend, judging from the rest of the question talking about "sampling".) – Silverfish Dec 06 '15 at 00:52
  • Nonconsecutive records, you're right. Sorry for the ambiguity. – user1809593 Dec 06 '15 at 00:53
  • Zachary: I'll try, but I'm trying to avoid the implicit `for` loop – user1809593 Dec 06 '15 at 00:54
  • Basically, how do you randomly subset a single reshuffling of indices in one go? – user1809593 Dec 06 '15 at 01:09
  • Arbitrarily sized as in e.g. 25 instead of 10? – tho_mi Dec 06 '15 at 00:34
  • A uniform distribution of sizes. – user1809593 Dec 06 '15 at 00:35
  • n=length(data); nn=sample(1:n,n,replace=FALSE); rand.start=sample(1:(n-1),1); rand.end=sample((rand.start+1):n,1); rand.samp=data[nn[rand.start:rand.end]]; # is that what you mean? – Zachary Blumenfeld Dec 06 '15 at 04:00
  • It's close. Suppose I create a while loop on this for `n` by setting and create a list for `rand.samp` that contains all the nonoverlapping, unequal length subsequences. Then you can verify `hist(unlist(rand.samp))` is heavily skewed to short sequences. How can I make the distribution of sequence lengths uniform? – user1809593 Dec 06 '15 at 05:54
  • After reading several times through the question and comments I still have no idea what you are asking. Because your usage of terms appears to be inconsistent with their standard meanings, I don't know what you mean by "sample"--it sounds like it could be a permutation, or maybe not--nor by "nonconsecutive-record", nor even by "sequence" or "arbitrary length"! Could you perhaps illustrate the input and desired output for a small example? – whuber Dec 06 '15 at 15:08
  • Any ambiguity is my fault. Let me try again: Ok, suppose I have the sequence d – user1809593 Dec 06 '15 at 15:33

1 Answers1

1

Here's a simple answer based on Zachary's comment. However, for larger datasets (dim(data1)[1]) it isn't efficient and plus it doesn't simply permute the dataset once via sample and then split it into arbitrarily sized samples, which was the elegant logic of the original post I cited above.

#data1 <- 1:14e6
#data <- sample(data1, length(data1), replace = F) #not necessary
data <- 1:14e6
K <- 0
rand.samp <- NULL
while(dim(as.matrix(data))[1] != 0) {
K <- 1 + K
n=length(data);
nn=sample(1:n,n,replace=FALSE);
rand.start=sample(1:(n-1),1);
rand.end=sample((rand.start+1):n,1);
rand.samp[[K]]=data[nn[rand.start:rand.end]];
data <- setdiff( data, unlist(rand.samp))
print(dim(as.matrix(data))[1])
                                    }

desired output is this list

rand.samp

histogram of random sequence lengths

hist(log10(unlist(lapply(rand.samp, length))), breaks = 100)