3

Let's say I have three groups of values where each group has the same number of values. However, it is unknown how many values there are per group (the values are not available anymore). For each group I do have available the mean and the variance. How can I calculate the mean and the variance from the total population? For the mean that should be easy: It is simply the mean of the means. But how about the variance?

Edit: How to calculate the variance of a partition of variables seem to deal with a similar issue.

Make42
  • 521
  • 4
  • 17
  • The answer depends on what formula you are referring to by "variance." Could you clarify? Would it perhaps be in the same sense as the question at https://stats.stackexchange.com/questions/10441 (as found by Alex Nikiforov)? – whuber Aug 29 '17 at 17:26
  • There's an old rule of thumb in statistics that the variance of the sums is equal to the sum of the variances. It may apply in your case. – Mike Hunter Aug 29 '17 at 17:58
  • When you're combining non-overlapping subgroups the variation between the means comes in as well (many, many posts on site deal with that issue). The wrinkle here is doing it when $n$ is unknown. – Glen_b Aug 30 '17 at 02:45

3 Answers3

2

First of all, the mean is not exactly the mean of the means. But, considering $N=n_1+n_2+n_3$ the population (that in this sense is the union of the the three groups) average is $\mu=\frac{n_1\mu_1+n_2\mu_2+n_3\mu_3}{n_1+n_2+n_3}$. Thus, you have a set of averages in a simplex generated by the constraints $N=n_1+n_2+n_3$ and $n_1,n_2,n_3>0$. For the variance ($\sigma^2$), you can use a similar approach. The population variance is the sum of the Between Group Variance and the Within Group Variance as follows: $$N\cdot \sigma^2=\sum\limits_{g=1}^3 n_g(\mu_g-\mu)^2+\sum\limits_{g=1}^3 n_g \sigma^2_g$$ Also in this case, considering that $$\sum\limits_{g=1}^3 n_g(\mu_g-\mu)^2=\sum\limits_{g=1}^3 n_g\mu_g^2-N\cdot\mu^2$$ your solution is one of the possible inside the simplex. But remember that $\mu$ and $\sigma$ depends both on the choice of $n_1$,$n_2$, and $n_3$. In your case, $n_1=n_2=n_3$ the total variance is $$\sigma^2=\frac{1}{3}\sum\limits_{g=1}^3 \left[(\mu_g-\mu)^2+\sigma^2_g\right]$$

0

Well, variance estimation can be obtained for two groups (for simplicity) as follows:

${\hat{\sigma}^2} = \frac{1}{2N}\sum_{i=1}^{2N}{(X_i-\mu)^2 = \frac{1}{2N}\sum_{i=1}^{N}{(X_i-\mu)^2} + \frac{1}{2N}\sum_{i=N+1}^{2N}{(X_i-\mu)^2}=\frac{1}{2}(\hat{\sigma}^2_1 + \hat{\sigma}^2_2}) = \hat{\sigma}^2$

where ${X_i}$ - is a random variable (values in your case, which are not available anymore) and ${\mu}$ - is a mean.

so, variance of the total population is average of variances for every group ${\frac{1}{2}(\hat{\sigma}^2_1 + \hat{\sigma}^2_2)}$ where ${\hat{\sigma}^2_{1}}$ is a variance of group 1, the same for group 2.

quick test on Octave, where ${x, y}$ - are two groups:

octave:1> x = 3*randn(1000, 1);
octave:2> y = 3*randn(1000, 1);
octave:3> var(x)
ans =  9.0051
octave:4> var(y)
ans =  8.8170
octave:5> 0.5*(var(x) + var(y))
ans =  8.9111
octave:6>

${\hat{\sigma}^2_{1} = 9.0051}$, ${\hat{\sigma}^2_{2} = 9.0051}$, ${\hat{\sigma}^2 = 8.9111}$

Think of your estimation as a random variable, it has it's own mean and variance.

[EDIT] there is a better answer (and more correct).

  • 1
    There are multiple errors in the first line that need to be corrected, such as the wrong denominator in the first formula and the disappearance of the $X_i$ for $i\gt N$ in the second formula. When you do that, could you explain what your symbols are intended to refer to and what assumptions you are making about them? For many possible interpretations your results are incorrect, so if you would like them to be understood as you intended, including such explanation is essential. – whuber Aug 29 '17 at 18:21
  • Hi, could you please suggest what else I should add? Formulas have been fixed, thanks! – Alex Nikiforov Aug 29 '17 at 18:27
  • I can only repeat myself: *explain your notation* and *tell us how you are interpreting* this (inherently ambiguous) question. BTW, the equalities are still incorrect: your first sum references $X_1, \ldots, X_{2N}$ whereas the sums after the equality reference only $X_1, \ldots, X_N$. What happened to $X_{N+1}, \ldots, X_{2N}$? What's the reason for putting a hat on "$\sigma$"? What are $\hat\sigma_1$ and $\hat\sigma_2$? What do you mean by "average"? What do the $X_i$ represent in the original question? How does this post address the question about *three* variances? – whuber Aug 29 '17 at 18:43
  • (Continued) Exactly what is "the mean" $\mu$? How is it related to the $X_i$? How is it related to the *three* means mentioned in the question? – whuber Aug 29 '17 at 18:44
  • What do you mean by "more correct"? I see that you are not using the notorious $-1$, while the other answer seems to. But the other answer also uses the Variance of the means, which confuses me. I am not sure how this relates to your formula. Can you explain? – Make42 Aug 30 '17 at 08:10
  • Other (more correct) answer properly estimate variance. Since estimate variance is a random variable it needs to be treated as a random variable. -1 is just biased/unbiased estimation (plz keep in mind that the question was about 3 estimations, so -1 IS important because of very limited sample number). So, if you tread var(X) as a random variable you need to estimate it as a random variable, look on the answer of Mr Tsjolder it has all needed details – Alex Nikiforov Aug 30 '17 at 08:42
0

I'm continue to simply Mr antonio irpino's answer in this answer, and note $\sum\limits_{i=1}^cn_i=n$ in here:

$$\begin{align}n\cdot \sigma^2&=\sum\limits_{i=1}^c n_i(\mu_i-\mu)^2+\sum\limits_{i=1}^c n_i \sigma^2_i\\ &=\sum\limits_{i=1}^c{n_i[\sigma^2_i+(\mu_i-\mu)^2]}\\ &=\sum_{i=1}^c{n_i(\sigma_i^2+\mu_i^2-2\mu \mu_i +\mu^2)}\\ &=\sum_{i=1}^c{n_i(\sigma_i^2+\mu_i^2)}+\sum_{i=1}^c{n_i(\mu^2-2\mu \mu_i)}\\ &=\sum_{i=1}^c{n_i(\sigma_i^2+\mu_i^2)}+\mu^2\sum_{i=1}^c{n_i}-2\mu\sum_{i=1}^c{n_i \mu_i}\\ &=\sum_{i=1}^c{n_i(\sigma_i^2+\mu_i^2)}+n \mu^2 -2n \mu^2\\ &=\sum\limits_{i=1}^cn_i(\sigma^2_i+\mu^2_i)-n\mu^2 \end{align}$$

conclusion:

It shows that we have the same symmetric structure for mean and variance in this case:

$$\begin{align}n&=\sum_{i=1}^c n_i\\ n\mu &=\sum\limits_{i=1}^{c}{n_i \mu_i}\\ n( \sigma ^2+\mu ^2 ) &=\sum_{i=1}^c{n_i( \sigma _i^2+\mu _i^2 )} \end{align}$$

Such a beautiful conclusion surprised me that I was the one who found it. At least I hadn't seen it anywhere else. It has quite the extreme beauty of Maxwell's system of equations. :)

yode
  • 101
  • 2