8

I don't know a lot about sampling methods.

I have a large population of size 2,000,000. I used one of those sample size calculators. It says that I need sample size of approximately 10,000.

I am trying to find the probability p of success for the population. It is not feasible for me to test all 2,000,000 members of the population. That is why I am sampling.

I assume that the sample size calculator means a simple random sample without replacement. I have read that a simple random sample with replacement ensures that the covariance between two variables is 0, i.e., independent.

When should one choose with replacement instead of without replacement?

If we sample with replacement, then we are simply performing Bernoulli trials. I suppose this makes applying statistical tools (which?) easier.

Again, sampling ignoramus here.

Martin Velez
  • 365
  • 1
  • 3
  • 8
  • Zero covariance does not imply independence, but independence implies zero covariance. More detail is [here](http://stats.stackexchange.com/questions/12842/covariance-and-independence) – Colin T Bowers Feb 07 '13 at 08:11
  • Also, you appear to have asked the [same question twice](http://stats.stackexchange.com/questions/49481/sampling-with-or-without-replacement) within the last 20 minutes. It would be useful if you deleted one of them so that answers don't get distributed across two questions. Cheers. – Colin T Bowers Feb 07 '13 at 08:18
  • @ColinTBowers If you mean the question I think you're referring to, while his title was (confusingly) the same, the content of the question looks different. – Glen_b Feb 07 '13 at 08:23
  • I deleted the other question. Wasn't sure if this one had posted successfully. Also, update question. – Martin Velez Feb 07 '13 at 08:26
  • if you can sample multiple times, you might want to see: http://en.wikipedia.org/wiki/Mark_and_recapture – R J Feb 07 '13 at 08:29
  • Where did you read that random sampling ensures a 0 covariance between variables? This is clearly incorrect: e.g. a random sample of people's heights and weights will certainly not have 0 covariance between them. – Peter Flom Feb 07 '13 at 11:28
  • It would be helpful to see your calculations. The sample of 10,000 gives you the margin of sampling error of about 1%. Sometimes you want a better accuracy; the "default" sample size in social science research is about 1,000 that gives a margin of error of about 3%. – StasK Feb 08 '13 at 14:41
  • 1
    Peter, there may be some mix up of the terminology here. From the point of view of finite population sampling, the only random variables are sample inclusion indicators, $I_i=1$ is unit is in the sample, $I_i=0$ if not. Height and weight are fixed characteristics of the units, and there's no randomness in them: your weight does not jump around by 20 lbs from one day to next, does it? Sampling with replacement ensures ${\rm Cov}(I_i,I_j)=0$ for $i \neq j$; sampling without replacement produces ${\rm Cov}(I_i,I_j)<0$. That's advanced material though, and only comes up in specialized classes. – StasK Feb 08 '13 at 14:44
  • I read in [Handbook of Probability](http://www.amazon.com/Handbook-Probability-Applications-Tamas-Rudas/dp/1412927145) that if the population is infinite, then sampling is usually done without replacement. When the population is finite, sampling is done with replacement. – Martin Velez Feb 14 '13 at 05:58
  • Handbooks are certainly the first stopping point, but life tends to be richer than the handbooks are :). – StasK Feb 14 '13 at 21:01

1 Answers1

9

From finite population perspective, the difference in variances of the sample means or totals obtained via sampling with replacement (SRSWR) and sampling without replacement (SRSWOR) is captured by the finite population correction (FPC): $$ \mathbb{V}_{\rm SRSWOR}[\bar y] = \Bigl( 1 - \frac{n}{N}\Bigr) \mathbb{V}_{\rm SRSWR}[\bar y] $$ where $n$ is the sample size, $N$ is the population size, and the FPC is the parentheses. For your problem, the FPC = 1 - 10,0000/2,000,000 = 1 - 1/200 = 0.995, and frankly I would not bother chasing that factor down, and treat it as being equal to 1. I typically tell my students to start keeping track of FPC when the sampling fraction $n/N \ge 0.1$.

Sometimes, the decision between SRSWOR and SRSWR is that of logistics, i.e., depends on how easy it is to organize one or the other. A simple method to draw an SRSWOR is to assign a random number $U_i \sim \mbox{i.i.d. } U[0,1]$ to every record $i=1,\ldots,N$, sort by $U_I$ and take the first $n$ entries. A simple method to draw SRSWR is to produce $n$ random numbers $V_j \sim \mbox{i.i.d. } U[0,1]$ and take units with indices $\{ [N V_j+1], j=1, \ldots, n \}$ (the brackets stand for the integer part). Depending on how your population (referred to as frame in sampling terminology) is organized, one may be easier than the other, or none may be feasible at all.

The standard sampling reference I give is Lohr (2009).

StasK
  • 29,235
  • 2
  • 80
  • 165
  • I read in [Handbook of Probability](http://www.amazon.com/Handbook-Probability-Applications-Tamas-Rudas/dp/1412927145) that if the population is infinite, then sampling is usually done without replacement. When the population is finite, sampling is done with replacement. – Martin Velez Feb 14 '13 at 05:58
  • @MartinVelez FPC is only used for the sampling population is large relative to the population. This never applies to an infinite population. When the population is finite, sampling can be done with and with our replacement. The estimator will be similar, but the variance will be very different. – SmallChess Aug 18 '15 at 12:57
  • Many thanks, but why we must *sort* by $U_I$ before taking the first $n$ entries? – Simon Harmel Aug 09 '20 at 19:29
  • @SimonHarmel This gives you $n$ random elements, and their probabilities of selection are equal. – StasK Oct 07 '20 at 20:56