How to approximate standard deviation with summarized stats

Question

I'm trying to summarize incoming data to speed up some anomaly detection queries... but I'm not sure how to retain/use the stats I'm summarizing.

Here's my basic goal:

1 million events > 1 minute bucket with stats (count, avg, sum, stdev, variance, min, max, etc). Then I have a query that retrieves 1 minute buckets from the last 10 Mondays, 9:00am - 9:05am.

That reduces things down to about 50 rows of statistics data. But how do I transform these summarized stats into an approximation of the standard deviation as if I calculated it from the 1 million x 50 events ?

So we have something to work with, let's use the below example (calculations using excel formulas).

raw events

value, time (hh:mm:ss)
    0, 9:00:00
    2, 9:00:20
    1, 9:00:40
    0, 9:01:00
    3, 9:01:30

summarized by minute

minute, count, sum, mean, stdev.p, variance.p .. others needed ? 
  9:00,     3,   3,    1,  0.8164,     0.6667
  9:01,     2,   4,    2,       1,          1

goal: recreate this from "summarized by minute" data

stdev.s( 0, 2, 1, 3, 1 ) = 1.1402

This is known as ["online" estimation](http://stats.stackexchange.com/questions/235129/online-estimation-of-variance-with-limited-memory). The linked answer should give you what you need. (This may be close enough to count as a duplicate?) — GeoMatt22, Dec 12 '16 at 23:28
The "online" version would be especially suitable if you're collecting the information "as you go" (though it can be used on data after the fact just fine). If you have all the means and variances in hand at the time of calculation, it would be a duplicate of numerous other posts on site. The formulas are derived [here](http://stats.stackexchange.com/questions/121107/is-there-a-name-or-reference-in-a-published-journal-book-for-the-following-varia) for example. See also [here](http://stats.stackexchange.com/questions/10441/how-to-calculate-the-variance-of-a-partition-of-variables)... ctd — Glen_b, Dec 12 '16 at 23:31
ctd... the n-denominator version of variance is done [here](http://stats.stackexchange.com/questions/43159/how-to-calculate-pooled-variance-of-two-groups-given-known-group-variances-mean) (though it's easily adapted to the Bessel-corrected case). Both on-line and off-line calculations are discussed [here](http://stats.stackexchange.com/questions/216047/how-does-one-go-about-determining-the-standard-deviation-of-an-entire-sample-dat/216060#216060). To my recollection there are a couple of others on site but I think these will cover pretty much every aspect of the problem you will care about. — Glen_b, Dec 12 '16 at 23:36

How to approximate standard deviation with summarized stats

0 Answers0