Calculating % unsampled in sampling with replacement

Question

You sample N of N items with replacement.

How do you calculate the expected percent not sampled from original population N?

Extra Credit: Generalize to sampling k of N items with replacement.

The answer to the initial question is about $36.8\%$. I didn't calculate this analytically but by simulation in R. `N=1000000; (N-length(unique(sample(c(1:N),N,replace=TRUE))))/N`. — Macro, Sep 09 '11 at 04:21

score 6 · Answer 1 · answered Sep 09 '11 at 03:45

6

If $N$ is large, the distribution of per item sampling frequencies is approximately Poisson distributed, with mean $k/N$. So you can estimate the unsampled proportion as $e^{-k/N}$. An exact solution for small $N$ is a bit of a chore to derive.

answered Sep 09 '11 at 03:45

Mike Anderson

1,459
9
4

Macro · Accepted Answer · 2011-09-09T15:53:32.687

Let $Z_{ij}$ be the binary indicator that subject $i = 1, ..., N$ was selected as the $j=1,...,k$'th sampled unit. Since the sampling (assumed to be simple random sampling) is with replacement, each of the $Z_{ij}$ are independent bernoulli trials with success probability $1/N$. Therefore the number of times subject $i$ was sampled,

$$ Y_{i} = \sum_{j=1}^{k} Z_{ij}, $$

has a ${\rm Binomial}(k,1/N)$ distribution. So, the probability that a particular unit is not sampled, $P(Y_{i} = 0)$, is calculated from the binomial mass function as

$$ P(Y_{i} = 0) = (1 - 1/N)^{k} $$

So, the indicator of subject $i$ not being sampled, $X_{i} = \mathcal{I}(Y_{i} = 0)$, is a bernoulli trial with success probability $(1 - 1/N)^{k}$. Note that the $Y_{i}$'s are not independent of each other since, for example if $Y_{1} = N$, then you know $Y_{2}, ..., Y_{N}$, are all 0. It follows that the $X_{i}$'s are also probably not independent of each other. Regardless of whether or not they are, linearity of expectation still holds so the expected proportion of the population that is not sampled is

$$ \mu_{k} = E \left( \frac{1}{N} \sum_{i=1}^{N} X_{i}\right) = \frac{1}{N} \sum_{i=1}^{N} E(X_{i}) = \frac{1}{N} \cdot N \cdot (1 - 1/N)^{k} = (1 - 1/N)^{k} $$

Edit: As Mike Anderson points out in his answer, this quantity is well approximated by $e^{-k/N}$. This is an example of the poisson approximation to the binomial, http://en.wikipedia.org/wiki/Binomial_distribution#Poisson_approximation

Calculating % unsampled in sampling with replacement

2 Answers2

Linked