How does one go about determining the standard deviation of an entire sample data set?

Question

I have frequently been faced with analyzing csv files that are limited in length to approximately 32,000 records in length (overflows at 32,768, from signed integer limits on a 16 bit wide data bus).

As a result, I have often had data that spans several hundred files, each containing up to the 32,000 records. Even this 32,000 record file length is variable and random, owing to an automated data collection process whose collection will halt if the process being monitored should stop.

Here is the issue -- I can very easily open these two-hundred plus files completely algorithmically and sequentially determine how many data points they contain and some desired summary statitistics. The typical statitistics of interest are the sample size, mean, and standard deviation.

Armed with these two-hundred-some-odd summary stats, I wish to extrapolate the mean and standard deviation of the entire sample population collected (up to 200+ times 32,000max). It seems to me that some moments about a mean are somehow involved in that the standard deviation of a hypothetical subset of say 16,000 data points will carry less weight (and therefore have less influence on determining the overall standard deviation), than a data set of 32,000 for example. For that reason, I suspect that variances of each subset somehow play a roll in this, or that a root-mean-square procedure will be able to reconstitute these summary statitistics in a meaningful way to further weight them in deriving the mean and standard deviation of the entire sample set. Your assistance is appreciated.

Are you asking how to combine the sample size, mean, and standard deviation from the individual files to calculate the overall sample size, mean, and standard deviation of all the records combined? — Henry, Jun 02 '16 at 23:58
I know how to derive total number of samples (summation), and overall mean from summing of individual means times sizes, divided by number of overall samples. I don't know how to derive the std deviation of the overall however, as intuitively,there might must be moments about the mean to consider. — Paul Jansen, Jun 03 '16 at 00:14
Yes at two passes on files already processed. I can also add code to extract more while file is open for processing should need arise in future data sets. Hope that helps. — Paul Jansen, Jun 03 '16 at 00:21
Several of these "jobs" have since been "closed out" and presented with conclusions. I know however that I will again be faced with this issue in the future as a limiting factor in the coding of the data collection software used -- software to which I have no control over. I say that to point out that at this point it is sort of academic in past pursuits, but useful preparing for future analyses -- prep work. — Paul Jansen, Jun 03 '16 at 00:33
@Glen_b It's possible computing $\tfrac{1}{n} \sum_i (x_i - \mu)^2$ is more numerically stable than computing $\tfrac{1}{n} \sum_i x_i^2 - \mu^2$. I'm probably irrationally paranoid though about floating point issues where $x + \epsilon$ returns $x$? Probably not an issue with doubles? — Matthew Gunn, Jun 03 '16 at 02:59
@MatthewGunn You should *certainly* not use $\tfrac{1}{n} \sum_i x_i^2 - \mu^2$. That could be disastrous. But that doesn't mean you need two passes through the data. Stable one-pass updating algorithms exist. — Glen_b, Jun 03 '16 at 03:04
@Glen_b Reading "One pass" made me think $\tfrac{1}{n} \sum_i x_i^2 - \mu^2$ was being suggested (which didn't sound like a good idea in this context...). Interesting point that good one pass algorithms exist! — Matthew Gunn, Jun 03 '16 at 03:10

score 4 · Accepted Answer · edited Apr 13 '17 at 12:44

Depending on what you need there are two standard problems that this can be considered as, and you might also have a combination of those problems (as discussed at items 3 and 4 below):

You've already processed a bunch of files and have for each $\bar{x}_i$, $s_i=\sqrt{V_i}$ and $n_i$ and wish to obtain an overall $\bar{x}$, $s$ and $n$ as if the data had been all in one file (I'll talk hereafter about variances but you can take those back to sds whenever you need them):

This problem is that of combining grouped means and variances (and $n$s) into overall statistics. It's dealt with in many posts here.

For example, variance formulas for both equal sample sizes and the unequal-sample-size case are given here, of which the (more general) unequal-$n$ case (after putting $n_j$ for $k_j$) is:

$Var(X_1,\ldots,X_{n}) = \frac{1}{n-1}\left(\sum_{j=1}^g (n_j-1) V_j + \sum_{j=1}^g n_j (\bar{X}_j - \bar{X})^2\right),$

where $V_i$ is the sample variance of the $i$-th group and $\bar{X}_i$ is the sample mean of the $i$-th group and the overall mean $\bar{X}=\frac{1}{n}\sum_i n_i \bar{X}_i$, where $n=\sum_i n_i$. A derivation is given here (though the notation differs it's not hard to relate the terms)

If you need the n-denominator version of variance, that's covered here

[I think from the sound of it this case probably covers your needs, but for the sake of completeness, for later readers with a slightly different version of your problem I want to get to case 3 below]
You haven't processed any files yet but wish to loop through them one at a time, "online" and have at the end an overall $\bar{x}$, $s$ and $n$ as if the data had been all in one file (though the standard deviation or variance is the only one that may not be obvious).

"Online updating" is discussed in a number of posts on site, including this one. This is essentially processing the observations one at a time across all the files, by updating mean and variance every observation.

Note the discussion there about saving a few calculation steps if you only need the overall variance at the very end.
There's potentially a third problem if you've processed some files in the past (like 1. above) and now wish to combine those summaries and then go on to process additional files that had not been done yet. Those two approaches above can be combined, processing the first part as in 1. but before dividing by $n-1$ you have the SSE that is to be updated in 2. and then updating the rest by going through the remaining files an observation at a time via the approach in 2.
You can also adapt an online-updating approach to the situation in 1, where you update the current values by the $i$-th group statistics; I won't detail it unless you need it, but it's easy enough to infer the way it works from the above information.

I think this should cover the main situations that could matter on this kind of problem.

That is it, thank you. I do recall now from academic studies, "The Total Variance Formula". This is exactly what I need. I typically have case 1 above, with data coming in simultaneously from "identical" processes conducted at at least seven different sites spread globally. Files become enormous within just hours, with very lengthy record lengths and file sizes. One single record often contains fifty or more measurements per site, each record only milliseconds apart. This formula will help immensely in managing such large data sets and I can't thank you enough for the help. — Paul Jansen, Jun 08 '16 at 02:23
I am astounded at all the various intelligent questions and comments throughout this pursuit for a solution. This is a clear display of very, very astute readers with a huge wealth of knowledge -- asking all the right questions and providing very sound comments/answers. Every one of these has high merit. I am glad I found this resource. I also sincerely appreciate the thoroughness of Glen's answer. Glen was very logical and thoughtful addressing the most likely of cases one would encounter. — Paul Jansen, Jun 08 '16 at 04:30

How does one go about determining the standard deviation of an entire sample data set?

1 Answers1

Linked

Related