35

Say there are $m+n$ elements split into two groups ($m$ and $n$). The variance of the first group is $\sigma_m^2$ and the variance of the second group is $\sigma^2_n$. The elements themselves are assumed to be unknown but I know the means $\mu_m$ and $\mu_n$.

Is there a way to calculate the combined variance $\sigma^2_{(m+n)}$?

The variance doesn't have to be unbiased so denominator is $(m+n)$ and not $(m+n-1)$.

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
user1809989
  • 453
  • 1
  • 4
  • 4
  • When you say you know the means and variances of these groups, are they parameters or sample values? If they are sample means/variances you should not use $\mu$ and $\sigma$... – Jonathan Christensen Nov 08 '12 at 19:18
  • I just used the symbols as a representation. Otherwise, it would have been hard to explain my problem. – user1809989 Nov 09 '12 at 00:11
  • 1
    For sample values, we usually use Latin letters (e.g. $m$ and $s$). Greek letters are usually reserved for parameters. Using the "correct" (expected) symbols will help you communicate more clearly. – Jonathan Christensen Nov 09 '12 at 02:31
  • No worries, I'll follow that from now on! Cheers – user1809989 Nov 09 '12 at 08:42
  • 1
    @Jonathan Because this is not a question about samples or estimation, one can legitimately take the view that $\mu$ and $\sigma^2$ are the *true* mean and variance of the empirical distribution of a batch of data, thereby justifying the conventional use of greek letters rather than latin letters to refer to them. – whuber Nov 09 '12 at 18:36
  • @whuber I would agree with you except for the note about the variance being biased at the end of the question, which doesn't make any sense unless we *are* in fact talking about samples. – Jonathan Christensen Nov 09 '12 at 18:40
  • @Jonathan I interpreted that remark as an attempt to emphasize that the question is *not* about estimation, so a bias correction would be unwarranted. – whuber Nov 09 '12 at 18:41
  • @whuber fair enough. – Jonathan Christensen Nov 09 '12 at 18:43

3 Answers3

39

The idea is to express quantities as sums rather than fractions.

Given any $n$ data values $x_i,$ use the definitions of the mean

$$\mu_{1:n} = \frac{1}{\Omega_{1;n}}\sum_{i=1}^n \omega_{i} x_i$$

and sample variance

$$\sigma_{1:n}^2 = \frac{1}{\Omega_{1;n}}\sum_{i=1}^n \omega_{i}\left(x_i - \mu_{1:n}\right)^2 = \frac{1}{\Omega_{1;n}}\sum_{i=1}^n \omega_{i}x_i^2 - \mu_{1:n}^2$$

to find the (weighted) sum of squares of the data as

$$\Omega_{1;n}\mu_{1:n} = \sum_{i=1}^n \omega_{i} x_i$$

and

$$\Omega_{1;n} \sigma_{1:n}^2 = \sum_{i=1}^n \omega_{i}\left(x_i - \mu_{1:n}\right)^2 = \sum_{i=1}^n \omega_{i}x_i^2 - \Omega_{1;n}\mu_{1:n}^2.$$

For notational convenience I have written $$\Omega_{j;k}=\sum_{i=j}^k \omega_i$$ for sums of weights. (In applications with equal weights, which are the usual ones, we may take $\omega_i=1$ for all $i,$ whence $\Omega_{1;n}=n.$)

Let's do the (simple) algebra. Order the indexes $i$ so that $i=1,\ldots,n$ designates elements of the first group and $i=n+1,\ldots,n+m$ designates elements of the second group. Break the overall combination of squares by group and re-express the two pieces in terms of the variances and means of the subsets of the data:

$$\eqalign{ \Omega_{1;n+m}(\sigma^2_{1:m+n} + \mu_{1:m+n}^2)&= \sum_{i=1}^{1:n+m} \omega_{i}x_i^2 \\ &= \sum_{i=1}^n \omega_{i} x_i^2 + \sum_{i=n+1}^{n+m} \omega_{i} x_i^2 \\ &= \Omega_{1;n}(\sigma^2_{1:n} + \mu_{1:n}^2) + \Omega_{n+1;n+m}(\sigma^2_{1+n:m+n} + \mu_{1+n:m+n}^2). }$$

Algebraically solving this for $\sigma^2_{m+n}$ in terms of the other (known) quantities yields

$$\sigma^2_{1:m+n} = \frac{\Omega_{1;n}(\sigma^2_{1:n} + \mu_{1:n}^2) + \Omega_{n+1;n+m}(\sigma^2_{1+n:m+n} + \mu_{1+n:m+n}^2)}{\Omega_{1;n+m}} - \mu^2_{1:m+n}.$$

Of course, using the same approach, $\mu_{1:m+n} = (\Omega_{1;n}\mu_{1:n} + \Omega_{n+1;n+m}\mu_{1+n:m+n})/\Omega_{1;n+m}$ can be expressed in terms of the group means, too.


Edit 1

An anonymous contributor points out that when the sample means are equal (so that $\mu_{1:n}=\mu_{1+n:m+n}=\mu_{1:m+n}$), the solution for $\sigma^2_{m+n}$ is a weighted mean of the group sample variances.

Edit 2

I have generalized the formulas to weighted statistics. The motivation for this is a recent federal court case in the US involving a dispute over how to pool weighted variances: a government agency contends the proper method is to weight the two group variances equally. In working on this case I found it difficult to find authoritative references on combining weighted statistics: most textbooks do not deal with this or they assume the generalization is obvious (which it is, but not necessarily to government employees or lawyers!).

BTW, I used entirely different notation in my work on that case. If in the editing process any error has crept into the formulas in this post I apologize in advance and will fix them--but that would not reflect any error in my testimony, which was very carefully checked.

whuber
  • 281,159
  • 54
  • 637
  • 1,101
  • 5
    The "homework" tag doesn't mean the question is elementary or stupid: it's used for self-study questions that can even include research-level queries. It distinguishes routine, more or less context-free questions (of the sort that might ordinarily grace the math forum) from specific applied questions. – whuber Nov 09 '12 at 18:33
  • I cannot understand your first passage: $n(\sigma^2+\mu^2) = \sum (x - \mu)^2 + n\mu^2 \stackrel{?}{=} \sum x^2$ In particular I get $\sum [(x-\mu)^2+\mu^2] = \sum [x^2-2x\mu]$ which requires $\mu = 0$ Am I missing something? Could you please explain this? – DarioP Oct 21 '14 at 07:46
  • 3
    @Dario $$\sum(x-\mu)^2+n\mu^2=(\sum x^2 - 2\mu\sum x + n \mu^2)+n\mu^2 = \sum x^2 - 2n\mu^2 + 2n\mu^2 = \sum x^2.$$ – whuber Oct 22 '14 at 15:33
  • Oh yes, I did a stupid sign mistake in my derivation, now is clear, thanks!! – DarioP Oct 22 '14 at 15:42
  • 4
    I guess this can be extended to an arbitrary number of samples as long as you have the mean and variance for each. Calculating pooled (biased) standard deviation in R is simply ``sqrt(weighted.mean(u^2 + rho^2, n) - weighted.mean(u, n)^2)`` where ``n``, ``u`` and ``rho`` are equal-length vectors. E.g. ``n=c(10, 14, 9)`` for three samples. – Jonas Lindeløv Dec 07 '14 at 13:45
  • 1
    @whuber Sorry to bring you back to an old question, but could you shed some light on how you went from $\sigma_{1:n}^2 = \frac{1}{n}\sum_{i=1}^n \left(x_i - \mu_{1:n}\right)^2 = \frac{n-1}{n}\left(\frac{1}{n-1}\sum_{i=1}^n \left(x_i - \mu_{1:n}\right)^2\right)$ to $(m+n)(\sigma^2_{1:m+n} + \mu_{1:m+n}^2) = \sum_{i=1}^{1:n+m} x_i^2$. I don't really understand where the $(m+n)(\sigma^2_{1:m+n} + \mu_{1:m+n}^2)$ came from. Sorry if I'm missing something basic. – Chris C Oct 04 '15 at 17:47
  • 1
    @Chris C It expresses the squares in terms of the variance and the mean by rewriting the usual formula giving a variance in terms of the squares and the mean. Recall that this says the variance of a dataset of $N$ values is $1/N$ times the sum of squares, minus the squared mean. Multiply all of these by $N$ to deduce (easily) that the sum of squares equals $N$ times the variance plus $N$ times the squared mean. That's the entire point of the proceedings: everything is straightforward when expressed as sums instead of averages. – whuber Oct 04 '15 at 17:57
  • 1
    @whuber Thanks so much for the help, took me a while but I think I get it now. All the best. – Chris C Oct 05 '15 at 05:18
  • @whuber I'd love a reference to the court case you mention that hinges on the correct formula for combining variances. – Mike Izbicki Feb 26 '21 at 22:01
  • 1
    @Mike Here is a public pdf document of one ruling in this (ongoing) case; see pp 13 - 14: https://www.govinfo.gov/content/pkg/USCOURTS-ca13-18-01229/pdf/USCOURTS-ca13-18-01229-0.pdf . – whuber Feb 26 '21 at 22:29
8

I'm going to use standard notation for sample means and sample variances in this answer, rather than the notation used in the question. Using standard notation, another formula for the pooled sample variance of two groups can be found in O'Neill (2014) (Result 1):

$$\begin{equation} \begin{aligned} s_\text{pooled}^2 &= \frac{1}{n_1+n_2-1} \Bigg[ (n_1-1) s_1^2 + (n_2-1) s_2^2 + \frac{n_1 n_2}{n_1+n_2} (\bar{x}_1 - \bar{x}_2)^2 \Bigg]. \\[10pt] \end{aligned} \end{equation}$$

This formula works directly with the underlying sample means and sample variances of the two subgroups, and does not require intermediate calculation of the pooled sample mean. (Proof of result in linked paper.)

Ben
  • 91,027
  • 3
  • 150
  • 376
0

Use the sample.decomp function in the utilities package

Statistical problems of this kind have now been automated in the sample.decomp function in the utilities package. This function can compute pooled sample moments from subgroup moments, or compute missing subgroup moments from the other subgroup moments and pooled moments. It works for decompositions up to fourth order ---i.e., decompositions of sample size, sample mean, sample variance/standard deviation, sample skewness, and sample kurtosis.


How to use the function: Here we give an example where we use the function to compute the sample moments of a pooled sample composed of three subgroups. To do this, we first generate a mock dataset DATA containing three subgroups with unequal sizes, and we pool these as the single dataset POOL. The moments of the subgroups and the pooled sample can be obtained using the moments function in the same package.

#Create some subgroups of mock data and a pooled dataset
set.seed(1)
N    <- c(28, 44, 51)
SUB1 <- rnorm(N[1])
SUB2 <- rnorm(N[2])
SUB3 <- rnorm(N[3])
DATA <- list(SUB1 = SUB1, SUB2 = SUB2, SUB3 = SUB3)
POOL <- c(SUB1, SUB2, SUB3)

#Show sample statistics for the subgroups
library(utilities)
moments(DATA)

      n sample.mean sample.var sample.skew sample.kurt NAs
SUB1 28  0.09049834  0.9013829  -0.7648008    3.174128   0
SUB2 44  0.18637936  0.8246700   0.3653918    3.112901   0
SUB3 51  0.05986594  0.6856030   0.3076281    2.306243   0

#Show sample statistics for the pooled sample
moments(POOL)

       n sample.mean sample.var sample.skew sample.kurt NAs
POOL 123    0.112096  0.7743711  0.04697463     2.95196   0

Now that we have set of moments for subgroups, we can use the sample.decomp function to obtain the pooled sample moments from the subgroup sample moments. As an input to this function you can either use the moments output for the subgroups or you can input the sample sizes and sample moments separately as vectors (here we will do the latter). As you can see, this gives the same sample moments for the pooled sample as direct computation from the underlying data.

#Compute sample statistics for subgroups
library(utilities)
MEAN   <- c(    mean(SUB1),     mean(SUB2),     mean(SUB3))
VAR    <- c(     var(SUB1),      var(SUB2),      var(SUB3))
SKEW   <- c(skewness(SUB1), skewness(SUB2), skewness(SUB3))
KURT   <- c(kurtosis(SUB1), kurtosis(SUB2), kurtosis(SUB3))

#Compute sample decomposition
sample.decomp(n = N, 
              sample.mean = MEAN,
              sample.var  = VAR,
              sample.skew = SKEW,
              sample.kurt = KURT,
              names       = names(DATA))

             n sample.mean sample.var sample.skew sample.kurt
SUB1        28  0.09049834  0.9013829 -0.76480085    3.174128
SUB2        44  0.18637936  0.8246700  0.36539179    3.112901
SUB3        51  0.05986594  0.6856030  0.30762810    2.306243
--pooled-- 123  0.11209600  0.7743711  0.04697463    2.951960

As you can see, the sample.decomp function allows computation of the pooled sample variance and also other pooled sample central moments, up to fourth order. You can read about this function in the package documentation. However, it is worth noting that for the higher-order moments (the skewness and kurtosis) there are several different statistics that can be used. The above functions accommodate these variations through the arguments skew.type, kurt.type and kurt.excess, which have default values that we have used in the above computations.

Ben
  • 91,027
  • 3
  • 150
  • 376