1

I encountered such a formula for pooled variance:

$$\frac{(n-1)s_x^2+(m-1)s_y^2}{n+m-2}\left(\frac{1}{n} + \frac{1}{m}\right)$$

Here we have two samples of the following sizes $n$ and $m$. $s_x, s_y$ are the sample variances.

I understand that the first term is a weighted average, but from the second term comes?

enter image description here

Yola
  • 138
  • 4

1 Answers1

5

$Var[\bar{x}-\bar{y}] = Var[\bar{x}] + Var[\bar{y}] = \frac{\sigma_x^2}{n} + \frac{\sigma_y^2}{m}$

Now we assume that $x$ and $y$ come from populations with the same variance then $\sigma_y^2=\sigma_x^2=\sigma^2$.

The expression for variance becomes:

$Var[\bar{x}-\bar{y}] = \sigma^2 \left(\frac{1}{n} + \frac{1}{m}\right)$

Since we are assuming that X and Y have the same variance we can estimate $\sigma^2$ using the pooled variance. (Edit: As whuber mentions in his comment, using the pooled variance assumes that $\bar{X}$ and $\bar{Y}$ are independant and also that $\mu_x=\mu_y$)

$\sigma^2 = \frac{(n-1)s_x^2+(m-1)s_y^2}{n+m-2}$

And therefore:

$$Var[\bar{x}-\bar{y}] = \frac{(n-1)s_x^2+(m-1)s_y^2}{n+m-2}\left(\frac{1}{n} + \frac{1}{m}\right)$$

Hugh
  • 3,659
  • 16
  • 22
  • 1
    Since you are being careful about mentioning the assumptions needed to carry out this analysis, consider including two additional crucial ones: first, that $\bar x$ and $\bar y$ are independent; and then, to justify the pooling, that you have implicitly assumed the two populations also have the same mean. – whuber Dec 29 '17 at 13:53
  • @whuber Thanks I never realised that pooled variance requires assuming that population means are equal. I think that only for gaussian $X$ the sample mean is independant from the sample variance. So in the case that $X$ and $Y$ are gaussian distributed we can forgo the assumption that population means are equal. Is that correct? – Hugh Dec 29 '17 at 16:10
  • 1
    The problem is more basic than that: if the population means are not equal, then the pooled variance does not estimate anything relevant to the distribution of the $t$ statistic. In the testing situation where variances are pooled, the null hypothesis is that means are equal, thereby justifying the pooling. For any other hypothesis, it's (at best) unclear what the pooled formula would represent or how it could be useful. – whuber Dec 29 '17 at 16:14
  • @Whuber That is a good point. What if we just want to know the variance of $\bar{x}+\bar{y}$ to create a prediction interval but not to test a hypothesis? – Hugh Dec 29 '17 at 16:37
  • What would you be predicting? The only thing that seems to make sense is to suppose the two samples are independent, independent of each other, and *from the same population;* in such a case, you would be predicting a new draw from the same population. That implies equal means. Otherwise, more than one population is in play and the meaning of a prediction would be unclear. – whuber Dec 29 '17 at 16:58
  • @Whuber Perhaps I'm playing a game where I pick n cards from one biased pack and m cards from another biased pack. In this game I win if the sum of the two averages of my cards is bigger than 10. I'm predicting the future sums of averages. I appreciate this is an obscure application but creating a prediction interval is not out of the question. – Hugh Dec 29 '17 at 17:35
  • You wouldn't use a pooled SD for this purpose: you want to find a prediction interval for the sum of the averages and a good one will involve separate estimates of the variances of the two sums. – whuber Dec 29 '17 at 18:11
  • Minor point but you should also say that the final result is an estimate of the variance of the difference between the two sample means. – Michael R. Chernick Dec 29 '17 at 21:33