3

Variance can be combined as

$$v=\frac{1}{n-1}\left(\sum_{i = 1}^{numGroups}n_{i}(m_{i}-m)^2+ \sum_{i = 1}^{numGroups}(n_{i}-1)v_{i}\right)$$

where $v$ is the combined variance, $n$ is the total sample size, $n_i$ is the number of points in group $i$, $numGroups$ is the total number of groups, $m_i$ is the mean of group $i$, $m$ is the combined mean, $v_i$ is the variance of the $i^{th}$ group

Is there a name for this formula or any reference to it?

Dilip Sarwate
  • 41,202
  • 4
  • 94
  • 200
Budhapest
  • 551
  • 1
  • 5
  • 12
  • 1
    This is, in essence, the _[Total Variance Formula](http://en.wikipedia.org/wiki/Law_of_total_variance)_ with the first term being the expectation of the conditional mean and the second term being the mean of the conditional variance. – Dilip Sarwate Oct 22 '14 at 22:51
  • [law of total variance](http://en.wikipedia.org/wiki/Law_of_total_variance) ? – Karolis Koncevičius Oct 22 '14 at 22:52
  • @DilipSarwate Thank you. Could you please expand on the derivation? – Budhapest Oct 22 '14 at 23:19
  • You might also call it the partition of total variance into *within* and *between* components, such as in ANOVA. If you want to ask what sounds like a new question -- about the derivation -- search to find out if it has already been answered on site and if it hasn't, ask a new question (in this case I am certain it has been answered before). – Glen_b Oct 23 '14 at 00:13

2 Answers2

6

Let $x_{i,j}$ denote the $j$-th data point in the $i$-th group which has $n_i$ data points. There are $N$ such groups and thus a total of $\sum_{i=1}^N n_i = n$ data points.

If the sample mean and sample variance of the $i$-th group are $m_i$ and $v_i$ respectively, then we have $$n_i\cdot m_i = \sum_{j=1}^{n_i} x_{i,j}\quad \text{and} \quad (n_i-1)v_i = \sum_{j=1}^{n_i} \left(x_{i,j} - m_i\right)^2.$$ It follows that $\displaystyle \sum_{i=1}^N \sum_{j=1}^{n_i} x_{i,j} = \sum_{i=1}^N n_i\cdot m_i = n\cdot m$ where $m$ is the overall mean of the $n$ data points. Similarly, the sum $\displaystyle \sum_{i=1}^N (n_i-1)v_i = \sum_{i=1}^N \sum_{j=1}^{n_i}\left(x_{i,j} - m_i\right)^2$ can be recognized as the sum of the squared deviations of the data points from the means of their respective groups. This is not quite what we want for calculating the variance of the $n$ data points: we need to know the sum of the squared deviations from $m$. Fortunately, all that is needed is a little algebra. We have that $$\begin{align} \sum_{i=1}^N\sum_{j=1}^{n_i} \left(x_{i,j} - m\right)^2 &= \sum_{i=1}^N \left[\sum_{j=1}^{n_i}\left(x_{i,j}^2 -2x_{i,j}m + m^2\right)\right]\\ &= \sum_{i=1}^N \left[\left(\sum_{j=1}^{n_i}x_{i,j}^2\right) -2n_im_im + n_im^2\right]\\ &= \sum_{i=1}^N \left[\left(\sum_{j=1}^{n_i}x_{i,j}^2\right) + n_i(m^2 -2m_im + m_i^2) - n_im_i^2\right]\\ &=\sum_{i=1}^N \left[n_i(m_i-m)^2 + \sum_{j=1}^{n_i}\left(x_{i,j}^2-m_i^2\right) \right]\\ &= \sum_{i=1}^N \left[n_i(m_i-m)^2 + \sum_{j=1}^{n_i}\left(x_{i,j}^2-2x_{i,j}m_i + m_i^2\right) \right]\\ &= \sum_{i=1}^N \left[n_i(m_i-m)^2 + \sum_{j=1}^{n_i}\left(x_{i,j}-m_i\right)^2 \right]\\ &= \sum_{i=1}^N \left[n_i(m_i-m)^2 + (n_i-1)v_i \right]. \end{align}$$ All that remains is to divide both sides by $n-1$ and we are done.

Dilip Sarwate
  • 41,202
  • 4
  • 94
  • 200
0

In the particular case when $N=2$, the formula can be rewritten: since $$\require{cancel} m=\frac{n_1m_1+n_2m_2}{n_1+n_2} $$ we have that $$ (m_1-m)^2 = \left(\frac{(\cancel{n_1m_1}+n_2m_1) - (\cancel{n_1m_1}+n_2m_2)}{n_1+n_2}\right)^2 = \left(\frac{n_2}{n_1+n_2}\right)^2 (m_1-m_2)^2. $$ Doing the same for $(m_2-m)^2$ and combining everything together, we obtain that the first term of the summation for $v$ is $$ \begin{align} \sum_{i=1}^2 n_i(m_i - m)^2 &= \frac{n_1n_2^2}{(n_1+n_2)^2} (m_1-m_2)^2 + \frac{n_1^2n_2}{(n_1+n_2)^2} (m_1-m_2)^2 \\ &= \frac{n_1n_2}{(n_1+n_2)\cancel{^2}}(m_1-m_2)^2\cancel{(n_1+n_2)}. \end{align} $$ Therefore $$ v = \frac{1}{n-1} \left( \frac{n_1n_2}{n}(m_1-m_2)^2 + (n_1-1)v_1 + (n_2-1)v_2 \right) $$

Rackbox
  • 101
  • 1