1

I was reading these notes:Finite Population Sampling with Application to the Hypergeometric Distribution and I have a question just about the first two pages. The first page, they derive the variance and expected value of the mean, or sample proportion, of the hypergeometric, and then on the next page compare it to the case if y was a binomial distribution.

I am confused because I do not see what mathematically causes the difference arises for the standard error- to me it seems like the sample proportion, or $\bar{y}$, is literally exactly the same for the binomial and hypergeometric case- i.e. the proportion of 1's in the sample. But these must actually be different, or you wouldnt get two different formulas for the variance. What am I missing here?

Steve
  • 385
  • 3
  • 10

1 Answers1

1

A binomial random variable is based on independent trials, often modeling sampling with replacement. A hypergeometric random variable is based on trials that are not independent, often modeling sampling without replacement.

A major difference between the two models is that for 'comparable' situations, the hypergeometric random variable has a smaller variance. Intuitively, you can view this as a consequence of the variety of choices decreasing for later trials because the 'population' becomes depleted due to sampling without replacement.

Consider an urn with 8 green chips (successes) and 8 red ones (failures). You sample $n = 8$ chips from the urn.

  • Random variable $X$ counts the successes under sampling with replacement. Then the probability of success on any one trial (draw) is $p = 8/16 = 1/2$ and $X \sim \mathsf{Binom}(n=8,p=1/2),$ with $E(X) = np = 8(1/2) = 4,$ $Var(X) = np(1-p) = 8/4 = 2.$

  • Random variable $Y$ counts the successes under sampling without replacement. Then the draws are not independent. Defining $p = g/T = 8/16 = 1/2,$ one can show that $E(Y) = np = 4,$ $$Var(X) = np(1-p)\left(\frac{T-n}{T-1}\right) = 2\frac{16-8}{16-1} = 2(8/15) = 16/15 = 1.0667.$$ Bear in mind that a a chips of one color are selected it becomes more likely to draw a chip of the opposite color on the next draw, so probabilities are affected by the depletion of the urn when sampling is without replacement.

Simulating a million realizations of each random variable in R, sample means and variances should agree with population means and variances, respectively, to a couple of decimal places:

set.seed(330)
x = rbinom(10^6, 8, .5)
mean(x);  var(x)
[1] 4.000247       # aprx E(X) = 4
[1] 1.997837       # aprx Var(X) = 2
y = rhyper(10^6, 8,8, 8)
mean(y);  var(y)
[1] 4.001567       # aprx E(Y) = 4
[1] 1.066994       # aprx Var(Y) = 1.0667

Here are bar charts for the relevant binomial (blue) and hypergeometric distributions in the example above. Notice that hypergeometric probabilities are more concentrated near the center.

enter image description here

g = 0:8
pdf.b = dbinom(g, 8, .5)
pdf.h = dhyper(g, 8,8, 8)
hdr = "PDFs of Binomial (blue) and Hypergeometric Dist'ns"
plot(g-.05, pdf.b, type="h", ylim=c(0,.4), lwd=2, col="blue",
     ylab="PDF", xlab="green", main = hdr)
lines(g+.05, pdf.h, type="h", lwd=2, col="brown")
abline(h = 0, col="green2")
BruceET
  • 47,896
  • 2
  • 28
  • 76