1

Let $a_1$, $a_2$, ...,$a_m$ be the samples of data, and let us further assume the only information we have about each sample is their count/no. of samples, mean, standard deviation and median.

The task I have set myself is to recover the true, or at least the best estimate possible, of the mean, median and standard deviation of the union of these samples $a_1 \cup a_2 \cup ... \cup a_m$, which I will call $A$.

Recover the Mean

Recovering the mean is straightforward as we can just use the number of samples to recover the mean for $A$.

$$\bar{x}_1 = \frac{1}{n_1} \sum_\limits{i=1}^{n_1} x_i, \quad \bar{x}_2 = \frac{1}{n_2} \sum_\limits{i=1}^{n_2} x_i \quad ... \quad$\bar{x}_m = \frac{1}{n_m} \sum_\limits{i=1}^{n_m} x_i$$

$$ \bar{x}_A = \frac{n_1\bar{x}_1 + n_2\bar{x}_2 + ... + n_m\bar{x}_m}{n_1 + n_2 + ... + n_m} $$

Recover the Standard Deviation

This seems like it should be possible.

The standard deviation of a particular sample is defined as: $$ s_i = \sqrt{\frac{\sum_\limits{k=1}^{n_i} (x_k - \bar{x}_i)^2}{n_i-1}}$$

It seems to me, we could do the following to attempt to recover the standard deviation of $A$. Essentially we could assume a symmetric deviation about the subsample mean for each data point, half below, half above, and calculate the new whole sample standard deviation using the difference between the whole sample mean $\bar{x}_A$ and each subsample mean $\bar{x}_i$.

For a particular sample, say $a_i$, let us assume one-half of the data points are below the sample mean, and one-half are above the sample mean.

Because we can recover the sample $A$ mean from the data, we can use this new calculate the difference between the sample $A$'s mean and the subsample mean. This can then be used to attempt a recovery of the sample $A$'s standard deviation.

Let $d_i$ be the difference of a particular subsample from the overall sample mean $\bar{x}_A$, and let $s_i$ be the subsample standard deviation. Then

$$ s_A = \sqrt{\frac{\frac{1}{2}n_1(d_1 + s_1)^2 + \frac{1}{2}n_1(d_1 - s_1)^2 \\+ \frac{1}{2}n_2(d_2 + s_2)^2 + \frac{1}{2}n_2(d_2 - s_2)^2 \\+ ... + \frac{1}{2}n_m(d_m + s_m)^2 + \frac{1}{2}n_m(d_m - s_m)^2}{n_1 + n_2 + ... + n_m - 1}} $$

Recover the Median

I see no straightforward way for this to be accurate. We do have an idea of the dispersion and the difference between the mean and the median for each sample so I do have glimmers of possibilities but I have not thought deeply or can see a very obvious path.

My Question for Cross Validated

Can anyone comment on these strategies, offer their expertise, or point me to some resources?

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
  • I believe all these questions have been answered here, so one resource is our site search. – whuber Jul 25 '15 at 21:05
  • I did a quick search but mainly hit on examples where histograms were available. I struggled to find anything that fits my scenario? – stats_novice_123 Jul 25 '15 at 21:08
  • http://stats.stackexchange.com/questions/43159, http://stats.stackexchange.com/questions/30495, http://stats.stackexchange.com/questions/12251, http://stats.stackexchange.com/questions/151947, etc. Use the name of a statistic, such as "median," "quantile," or "variance" in a search along with a keyword like "combine" or "pool". I'm getting hundreds of hits--not all of which are appropriate, of course, but many look useful. – whuber Jul 25 '15 at 21:12
  • thank you whuber, I have found a lovely resource here: http://www.burtonsys.com/climate/composite_standard_deviations.html I will perhaps post an answer for the standard deviation – stats_novice_123 Jul 25 '15 at 21:31