Recover true statistics for a union of subsamples - only data available are summary statistics for each subsample

Question

Let $a_1$, $a_2$, ...,$a_m$ be the samples of data, and let us further assume the only information we have about each sample is their count/no. of samples, mean, standard deviation and median.

The task I have set myself is to recover the true, or at least the best estimate possible, of the mean, median and standard deviation of the union of these samples $a_1 \cup a_2 \cup ... \cup a_m$, which I will call $A$.

Recover the Mean

Recovering the mean is straightforward as we can just use the number of samples to recover the mean for $A$.

$$\bar{x}_1 = \frac{1}{n_1} \sum_\limits{i=1}^{n_1} x_i, \quad \bar{x}_2 = \frac{1}{n_2} \sum_\limits{i=1}^{n_2} x_i \quad ... \quad$\bar{x}_m = \frac{1}{n_m} \sum_\limits{i=1}^{n_m} x_i$$

$$ \bar{x}_A = \frac{n_1\bar{x}_1 + n_2\bar{x}_2 + ... + n_m\bar{x}_m}{n_1 + n_2 + ... + n_m} $$

Recover the Standard Deviation

This seems like it should be possible.

The standard deviation of a particular sample is defined as: $$ s_i = \sqrt{\frac{\sum_\limits{k=1}^{n_i} (x_k - \bar{x}_i)^2}{n_i-1}}$$

It seems to me, we could do the following to attempt to recover the standard deviation of $A$. Essentially we could assume a symmetric deviation about the subsample mean for each data point, half below, half above, and calculate the new whole sample standard deviation using the difference between the whole sample mean $\bar{x}_A$ and each subsample mean $\bar{x}_i$.

For a particular sample, say $a_i$, let us assume one-half of the data points are below the sample mean, and one-half are above the sample mean.

Because we can recover the sample $A$ mean from the data, we can use this new calculate the difference between the sample $A$'s mean and the subsample mean. This can then be used to attempt a recovery of the sample $A$'s standard deviation.

Let $d_i$ be the difference of a particular subsample from the overall sample mean $\bar{x}_A$, and let $s_i$ be the subsample standard deviation. Then

$$ s_A = \sqrt{\frac{\frac{1}{2}n_1(d_1 + s_1)^2 + \frac{1}{2}n_1(d_1 - s_1)^2 \\+ \frac{1}{2}n_2(d_2 + s_2)^2 + \frac{1}{2}n_2(d_2 - s_2)^2 \\+ ... + \frac{1}{2}n_m(d_m + s_m)^2 + \frac{1}{2}n_m(d_m - s_m)^2}{n_1 + n_2 + ... + n_m - 1}} $$

Recover the Median

I see no straightforward way for this to be accurate. We do have an idea of the dispersion and the difference between the mean and the median for each sample so I do have glimmers of possibilities but I have not thought deeply or can see a very obvious path.

My Question for Cross Validated

Can anyone comment on these strategies, offer their expertise, or point me to some resources?

I believe all these questions have been answered here, so one resource is our site search. — whuber, Jul 25 '15 at 21:05
I did a quick search but mainly hit on examples where histograms were available. I struggled to find anything that fits my scenario? — stats_novice_123, Jul 25 '15 at 21:08
http://stats.stackexchange.com/questions/43159, http://stats.stackexchange.com/questions/30495, http://stats.stackexchange.com/questions/12251, http://stats.stackexchange.com/questions/151947, etc. Use the name of a statistic, such as "median," "quantile," or "variance" in a search along with a keyword like "combine" or "pool". I'm getting hundreds of hits--not all of which are appropriate, of course, but many look useful. — whuber, Jul 25 '15 at 21:12
thank you whuber, I have found a lovely resource here: http://www.burtonsys.com/climate/composite_standard_deviations.html I will perhaps post an answer for the standard deviation — stats_novice_123, Jul 25 '15 at 21:31

score 0 · Answer 1 · answered May 04 '21 at 05:00

Mean: You answered yourself
Standard deviation: Find first a pooled variance, see How to calculate pooled variance of two or more groups given known group variances, means, and sample sizes?
Median: This is more difficult, see Is it possible to calculate Q1, Median, Q3, StDev from already aggregated data?, Weighted median, Calculate one median for data from five experimental repetitions

Recover true statistics for a union of subsamples - only data available are summary statistics for each subsample

1 Answers1

Linked