I'm trying to summarize incoming data to speed up some anomaly detection queries... but I'm not sure how to retain/use the stats I'm summarizing.
Here's my basic goal:
1 million events > 1 minute bucket with stats (count, avg, sum, stdev, variance, min, max, etc). Then I have a query that retrieves 1 minute buckets from the last 10 Mondays, 9:00am - 9:05am.
That reduces things down to about 50 rows of statistics data. But how do I transform these summarized stats into an approximation of the standard deviation as if I calculated it from the 1 million x 50 events ?
So we have something to work with, let's use the below example (calculations using excel formulas).
raw events
value, time (hh:mm:ss)
0, 9:00:00
2, 9:00:20
1, 9:00:40
0, 9:01:00
3, 9:01:30
summarized by minute
minute, count, sum, mean, stdev.p, variance.p .. others needed ?
9:00, 3, 3, 1, 0.8164, 0.6667
9:01, 2, 4, 2, 1, 1
goal: recreate this from "summarized by minute" data
stdev.s( 0, 2, 1, 3, 1 ) = 1.1402