1

Let's say I've got 2 different populations (e.g. 35 and 40) with 2 different means and SDs (22.3, 28.0) and n (e.g. 35,40). I don't actually know the individual values. Obviously you can calculate a pooled SD (NOT what I want) it in R with something like this:

n <- c(35,40)
mean <- c(22.3,28.0)
sd <- c(3.2, 4.9)
df <- data.frame(n,mean,sd)

sqrt( sum(df$sd^2 * (df$n - 1)) / (sum(df$n - 1)) )

This gives an SD of around 4.194827.

However, would it be valid to assume these 2 populations are normal, synthesise them using the means and SDs given, and then take an SD of these 2 hypothetical populations? I don't see why not as long as I'm not explicit we've estimated the SD in a paper's methods?

I could do this e.g. with:

> result <- NULL
> for (i in 1:10000) {
+   control <- rnorm(n = 35,mean = 22.3, sd=3.2)
+   active <- rnorm(n = 40,mean = 28, sd=4.9)
+   result <- c(result,sd(c(active,control)))
+ }
> mean(result)
[1] 5.06259

I iterated 10000 times to get a good average.

So

1) Is this a valid calculation?

2) If yes/no, is there a better way of doing it?

A colleague appears to have an excel spreadsheet which gives a very similar number using this formula:

=SQRT((A2*(D2^2)+A2*((C2-G2)^2)+B2*(F2^2)+B2*((E2-G2)^2))/(A2+B2))

Where A, C and E are n, mean, sd of the 1st arm, and B, D, F are the same of the 2nd arm. G is the weighted mean of the 2 groups.

It seems to give the same result as my synthetic data so I assume it is valid?

Thanks

James
  • 453
  • 5
  • 12
  • See numerous answers on site, such as [this](http://stats.stackexchange.com/questions/121107/is-there-a-name-or-reference-in-a-published-journal-book-for-the-following-varia) or [this](http://stats.stackexchange.com/questions/30495/how-to-combine-subsets-consisting-of-mean-variance-confidence-and-number-of-s) – Glen_b Jan 11 '16 at 01:10

1 Answers1

1

I would describe the quantity you are looking for as the marginal or unconditional variance (not conditioning on the mean of each of the subgroups). As opposed to the conditional or average residual variance (conditioning on the mean of each subgroup). You can derive this theoretically using the law of total variance: $$ \begin{align*} Var(X) & = E(Var(X|group)) + Var(E(X|group)) \\ & = \sum_{g \in group} \left( P(G) Var(X_g) \right) + 1/N_{groups} \sum_{g \in group} \left(E(X_g) - E(X)\right)^2, \end{align*} $$ where $E(X) = \sum_{g \in group} E(X_g) P(G)$ is the marginal mean and $P(G)$ is the probability of belonging to group $G$ (which you might estimate using the observed counts of each group).

Andrew M
  • 2,696
  • 14
  • 25