Is this a valid way to calculate an SD of a pooled sample (NOT a pooled SD)?

Question

Let's say I've got 2 different populations (e.g. 35 and 40) with 2 different means and SDs (22.3, 28.0) and n (e.g. 35,40). I don't actually know the individual values. Obviously you can calculate a pooled SD (NOT what I want) it in R with something like this:

n <- c(35,40)
mean <- c(22.3,28.0)
sd <- c(3.2, 4.9)
df <- data.frame(n,mean,sd)

sqrt( sum(df$sd^2 * (df$n - 1)) / (sum(df$n - 1)) )

This gives an SD of around 4.194827.

However, would it be valid to assume these 2 populations are normal, synthesise them using the means and SDs given, and then take an SD of these 2 hypothetical populations? I don't see why not as long as I'm not explicit we've estimated the SD in a paper's methods?

I could do this e.g. with:

> result <- NULL
> for (i in 1:10000) {
+   control <- rnorm(n = 35,mean = 22.3, sd=3.2)
+   active <- rnorm(n = 40,mean = 28, sd=4.9)
+   result <- c(result,sd(c(active,control)))
+ }
> mean(result)
[1] 5.06259

I iterated 10000 times to get a good average.

So

1) Is this a valid calculation?

2) If yes/no, is there a better way of doing it?

A colleague appears to have an excel spreadsheet which gives a very similar number using this formula:

=SQRT((A2*(D2^2)+A2*((C2-G2)^2)+B2*(F2^2)+B2*((E2-G2)^2))/(A2+B2))

Where A, C and E are n, mean, sd of the 1st arm, and B, D, F are the same of the 2nd arm. G is the weighted mean of the 2 groups.

It seems to give the same result as my synthetic data so I assume it is valid?

Thanks

See numerous answers on site, such as [this](http://stats.stackexchange.com/questions/121107/is-there-a-name-or-reference-in-a-published-journal-book-for-the-following-varia) or [this](http://stats.stackexchange.com/questions/30495/how-to-combine-subsets-consisting-of-mean-variance-confidence-and-number-of-s) — Glen_b, Jan 11 '16 at 01:10

score 1 · Answer 1 · answered Jan 10 '16 at 20:46

I would describe the quantity you are looking for as the marginal or unconditional variance (not conditioning on the mean of each of the subgroups). As opposed to the conditional or average residual variance (conditioning on the mean of each subgroup). You can derive this theoretically using the law of total variance: $$ \begin{align*} Var(X) & = E(Var(X|group)) + Var(E(X|group)) \\ & = \sum_{g \in group} \left( P(G) Var(X_g) \right) + 1/N_{groups} \sum_{g \in group} \left(E(X_g) - E(X)\right)^2, \end{align*} $$ where $E(X) = \sum_{g \in group} E(X_g) P(G)$ is the marginal mean and $P(G)$ is the probability of belonging to group $G$ (which you might estimate using the observed counts of each group).

Is this a valid way to calculate an SD of a pooled sample (NOT a pooled SD)?

1 Answers1