2

I am stuck with a very simple question, but I don't really understand sampling, so please help me.

Assume that I perform Bernoulli sampling with parameter $q$ on data D, and obtain sample S1. Then on S1, I perform another Bernoulli sampling with parameter $q'$, and obtain result sample S2. Is the resulting sample S2 a random sample of the original data D?

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
  • 1
    In an [answer to a related question](http://stats.stackexchange.com/questions/50/what-is-meant-by-a-random-variable/54894#54894) about random variables I describe a standard model for sampling that might help you reason through this question. – whuber Dec 03 '13 at 16:18
  • @whuber: Thank you very much for the model. You said that for a random variable we can say that "X(ω) will lie between such-and-such (a) and such-and-such (b)". I am not sure if this holds in the problem I described? – Long Vehicle Dec 03 '13 at 17:07
  • 1
    What you should focus on is the drawing-tickets-from-a-box metaphor. In its terms, your question is this: I drew a bunch of tickets from a box according to a certain procedure. Then I randomly chose among those I just drew. Is this the same as if I had originally chosen a smaller number of tickets? The answer might depend on the procedures used to draw tickets, but in some cases it should be clear. To get started, you ought to think about the following: for any given ticket in the box, what is its chance of winding up in the ultimate (second) sample? – whuber Dec 03 '13 at 17:51

2 Answers2

1

A rigorous and conceptually simple way to assess sampling procedures is to compute the chance that any particular subset of the population could be the sample.

In Bernoulli sampling, independent Bernoulli random variables $X_i$ (with stipulated probabilities, usually all equal) are associated with the population members $i$. The sample consists of all members $i$ for which $X_i=1$. The question concerns a situatoin where this procedure is repeated--with a new independent set of random variables $Y_j$--on the sample that is obtained.

At this point we can use a little trick to clarify the situation: in addition to assigning random variables $Y_j$ to the members of the first sample, also assign random variables $Y_{j'}$ to all other members $j'$ of the population. Because we never observe these random variables, including them in this conceptual setup makes no difference.

Thus, the sampling model is this: associated with each population member $i$ are two independent Bernoulli variables, $X_i$ with parameter $q$ and $Y_i$ with parameter $q'$. The chance that $i$ appears in the final sample is the chance that both $X_i=1$ and $Y_i=1.$ Because these two variables are independent, their probabilities multiply, whence the chance that $i$ is in the sample equals $qq'$. The chance that $j\ne i$ is also in the sample is--by construction--independent of that outcome, so that's all the computation we need to do: the probability of any subset equaling the sample is the product of probabilities of its members.

In effect, then, the two-stage procedure determines whether $i$ is in the sample by evaluating the product $X_iY_i$. This random variable has the chance $qq'$ of equaling $1$ and the chance $1-qq'$ of equaling $0$: that is precisely what it means to be a Bernoulli variable. We have thereby seen that the two-stage procedure is Bernoulli sampling with probability $qq'$.

whuber
  • 281,159
  • 54
  • 637
  • 1,101
0

A nice overview of sampling techniques for this purpose can be found in
[Non-Uniformity Issues and Workarounds in Bounded-Size Sampling, VLDB 2013] and
[Maintaining bounded-size sample synopses of evolving datasets, VLDB 2007].

The answer can be found in Section 2.3 from [Non-Uniformity Issues and Workarounds in Bounded-Size Sampling, VLDB 2013]. It is a uniform sample, as long as we perform subsampling based on the original datasize rather than on the sample size. In this case, sample size is probabilistic.