3

Extra-binomial variation is defined in this Oxford Reference source:

Greater variability in repeat estimates of a population proportion than would be expected if the population had a binomial distribution. For example, suppose that $n$ observations are taken on independent Bernoulli variables that take the value $1$ with probability $p$, and the value $0$ with probability $1−p$. The mean of the total of the observations will be $np$ and the variance will be $np(1−p)$. However, if the probability varies from variable to variable, with overall mean $p$ as before, then the variance of the total will now be $\mathbf{>np(1−p)}$.

I do not follow this statement. Say we are comparing two variables:

$X \sim Bin(5, 0.5)$ (so $E(X) = np = 2.5$, and $var(X) = np(1-p) = 1.25$).

$Y = \sum_{i=1}^{5} Z_i$, where $Z_1, Z_2, Z_3, Z_4, Z_5$ are Bernoulli with probabilities $0.1, 0.3, 0.6, 0.7$ and $0.8$, respectively. The $Z_i$'s are independent of each other and of $X$.

So $E(X) = 2.5 = E(Y)$, and the condition in the reference is met ("the probability varies from variable to variable, with overall mean $p$ as before").

Then: $$var(Y) = \sum_{i=1}^5 var(Z_i) = \sum_{i=1}^5 p_i(1-p_i)$$ $$= 0.1(1-0.1) + 0.3(1-0.3) + 0.6(1-0.6) + 0.7(1-0.7) + 0.8(1-0.8) = 0.91$$

So $var(X) = 1.25$, $var(Y) = 0.91$, and $var(Y) < np(1-p) = var(X)$, counter to the last line of the quoted reference. Am I correct in pointing out that the reference is wrong, or have I made a mistake somewhere?

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
bob
  • 625
  • 3
  • 14

1 Answers1

2

This is an interpretation issue: there are multiple ways to interpret the statement, and they given different results

  1. We know from the original question that taking one of each $p\in\{0.1,0.3,0.6,0.7,0.8\}$ gives $\mathrm{var}[Y]=0.91<5\bar p(1-\bar p)$

  2. We might also mean that $p$ is a random variable, and want to average over its distribution

> r<-replicate(100000,{
+     p<-sample(c(0.1,0.3,0.6,0.7,0.8),5, replace=TRUE)
+     sum(rbinom(5,1,p))
+ })
> var(r)
[1] 1.250052

So far, the claim isn't looking very good. In fact, de Finetti's theorem tells us that 2 has to give 1.25 as the answer: the distribution of exchangeable binary variables is iid Bernoulli conditional on the mean of $p$.

But we're not done yet. Suppose we took more than one observation with each $p$

  1. The one-of-each approach by simulation
> r<-replicate(100000,{
+     p<-sample(c(0.1,0.3,0.6,0.7,0.8),5, replace=FALSE)
+     sum(rbinom(5,10,p))
+ })
> var(r)
[1] 9.049306
  1. The random-$p$ approach, by simulation
> r<-replicate(100000,{
+     p<-sample(c(0.1,0.3,0.6,0.7,0.8),5, replace=TRUE)
+     sum(rbinom(5,10,p))
+ })
> var(r)
[1] 43.29736

In this case $\bar p=0.5$ and the constant-$p$ formula gives $50\bar p(1-\bar p)=12.5$

So, the one-of-each variance is smaller than $50\bar p(1-\bar p)=12.5$ and the random-$P$ variance is larger.

That's the general phenomenon the reference was talking about. Varying $p$ gives you overdispersion, but only if you take more than one observation from each $p$. There's no such thing as overdispersed exchangeable binary data.

We can do something analytic, to finish off. Suppose $p$ is random with mean $p_0$ and variance $\tau^2$, and the conditional distribution of $Y|p$ is Binomial(m,p).

The conditional variance decomposition says $$\mathrm{var}[Y] = E[\mathrm{var}[Y|p]]+\mathrm{var}[E[Y|p]]$$ which comes to $$E[mp(1-p)]+\mathrm{var}[mp]=E[mp(1-p)]+m^2\mathrm{var}[p]$$ Now $$E[mp(1-p)]=E[mp]-E[mp^2] = mp_0-mp_0^2-m\tau^2$$ so $$E[mp(1-p)]+\mathrm{var}[mp]= mp_0-mp_0^2-m\tau^2+m^2\tau^2$$

If (and only if) $m=m^2$ this simplifies to $\mathrm{var}[Y]=mp_0(1-p_0)$. For $m>1$ it is larger. On the other than, the variance of $Y$ conditional on $p$ is always smaller than $mp_0(1-p_0)$, which fits with approach 1.

Thomas Lumley
  • 21,784
  • 1
  • 22
  • 73