5

When doing a simple random sample to estimate population mean for some statistic, how do I know whether sampling happens with or without replacement?

It feels wrong to use replacement, because 1) my AP stats teacher never does that and 2) I might use someone's data twice in the average.

But on the other hand, the proof that the statistic is an unbiased estimator of the mean is $$E(X)=E(X_1)+\cdots +E(X_n)=\mu+\cdots+\mu$$ which implies $$E\left(\frac{X}{n}\right)=\frac{n\mu}{n}=\mu$$ But doesn't this assume that the $n$ statistics $X_i$ are independent of each other? And isn't that only true if we replace after each sample?

user45031
  • 53
  • 1
  • 3

1 Answers1

2

Linearity of expectation doesn't rely on independence.

It's only the variance that's affected. If you sample without replacement (as most - but not all - population sampling is done), it reduces the variance a little (at least it's little under the common situation where the sample is much smaller than the population; hence the rule of thumb about ignoring it when the sample is sufficiently small)

For simple random sampling without replacement, it's actually quite easy to work out the mean and variance from fairly simple reasoning.

Formulas for estimating means and proportions under sampling without replacement are readily found - for example, here.

Glen_b
  • 257,508
  • 32
  • 553
  • 939
  • Thanks! Is there ever a reason to sample without replacement, if I could sample with? What is the point of using sampling without replacement? – user45031 May 05 '14 at 23:47
  • In some situations it's at least inconvenient to avoid the small risk of sampling the same unit twice. In most sampling, without replacement is quite natural, but under some conditions it's not. For example, depending on how the technology is set up, [random digit dialing](http://en.wikipedia.org/wiki/Random_digit_dialing) might sample with replacement, but the miniscule chance of dialing the same number twice (sample size relative to population being sampled) may well make it not worth worrying about (though one might guard against it by simply asking first if they'd already been surveyed). – Glen_b May 05 '14 at 23:52
  • What's wrong with using the same data twice though? As long as the random numbers told you to do so, you haven't done anything wrong right? Isn't "avoiding the risk" introducing human bias into the analysis? – user45031 May 06 '14 at 00:26
  • I didn't say anything was *wrong*. You just asked why you wouldn't *always* sample without replacement. I gave an example where you might not; the cost is that you might have slightly higher variance. There's no "human bias" in either sampling with or without replacement; it's a choice you can make a priori, and if your technology gives you the second but you want the first it's not necessarily difficult to still work as if you had the second. – Glen_b May 06 '14 at 01:02
  • Sorry, I didn't mean it to be impolite! I was just wondering if there is any problem with using data twice, or if that's ok. Thank you very much for your answer! – user45031 May 06 '14 at 01:04
  • Well, the problem is really only that you have two copies of the same unit in your sample, with the resulting effects of that. Usually you'd avoid that if you could do so easily. – Glen_b May 06 '14 at 01:05
  • Oh, I edited my earlier comment (above your last one) to respond to the part about human bias. – Glen_b May 06 '14 at 01:10
  • 1
    The gain of sampling WOR depends on the fraction of the population that is sampled. This may or not be "very little". Moreover, in multi-stage samples, if PSUs are sampled WR, then a different set of second stage units should be selected each time a PSU is drawn. – Steve Samuels May 08 '14 at 01:32
  • @Steve Quite correct; the "a little" only works under the assumption $N>>n$. – Glen_b May 08 '14 at 01:41
  • @Glen_b if we sample without replacement, then will the sample mean (1/N(X_1+X_2+....+X_N))be an unbiased estimator of the population mean? First, as you mentioned, X_i and X_j are not independent. Second, I think X_i and X_j are not even have identical distribution right? since for j>i, the distribution of X_j changes a little bit comparing to the distribution of X_i right? Hence, if we let mu be the true mean of the population, then E[X_1] = mu, but E[X_j] will not be mu right? for all j>2. – KevinKim Aug 14 '16 at 17:25
  • This is already dealt with in the very first sentence of the answer. – Glen_b Apr 18 '19 at 02:55