0

I'm trying to summarize incoming data to speed up some anomaly detection queries... but I'm not sure how to retain/use the stats I'm summarizing.

Here's my basic goal:

1 million events > 1 minute bucket with stats (count, avg, sum, stdev, variance, min, max, etc). Then I have a query that retrieves 1 minute buckets from the last 10 Mondays, 9:00am - 9:05am.

That reduces things down to about 50 rows of statistics data. But how do I transform these summarized stats into an approximation of the standard deviation as if I calculated it from the 1 million x 50 events ?

So we have something to work with, let's use the below example (calculations using excel formulas).

raw events

value, time (hh:mm:ss)
    0, 9:00:00
    2, 9:00:20
    1, 9:00:40
    0, 9:01:00
    3, 9:01:30

summarized by minute

minute, count, sum, mean, stdev.p, variance.p .. others needed ? 
  9:00,     3,   3,    1,  0.8164,     0.6667
  9:01,     2,   4,    2,       1,          1

goal: recreate this from "summarized by minute" data

stdev.s( 0, 2, 1, 3, 1 ) = 1.1402
Kit
  • 101
  • This is known as ["online" estimation](http://stats.stackexchange.com/questions/235129/online-estimation-of-variance-with-limited-memory). The linked answer should give you what you need. (This may be close enough to count as a duplicate?) – GeoMatt22 Dec 12 '16 at 23:28
  • The "online" version would be especially suitable if you're collecting the information "as you go" (though it can be used on data after the fact just fine). If you have all the means and variances in hand at the time of calculation, it would be a duplicate of numerous other posts on site. The formulas are derived [here](http://stats.stackexchange.com/questions/121107/is-there-a-name-or-reference-in-a-published-journal-book-for-the-following-varia) for example. See also [here](http://stats.stackexchange.com/questions/10441/how-to-calculate-the-variance-of-a-partition-of-variables)... ctd – Glen_b Dec 12 '16 at 23:31
  • ctd... the n-denominator version of variance is done [here](http://stats.stackexchange.com/questions/43159/how-to-calculate-pooled-variance-of-two-groups-given-known-group-variances-mean) (though it's easily adapted to the Bessel-corrected case). Both on-line and off-line calculations are discussed [here](http://stats.stackexchange.com/questions/216047/how-does-one-go-about-determining-the-standard-deviation-of-an-entire-sample-dat/216060#216060). To my recollection there are a couple of others on site but I think these will cover pretty much every aspect of the problem you will care about. – Glen_b Dec 12 '16 at 23:36

0 Answers0