Algorithim to determine if point is "too far from the average"

Question

Long story short I have a collection of about 30 scripts that manipulate data sets and place them in a database. These scripts report their running times as well as any errors that occur to a separate database. I wrote another script that goes through this database daily and for each script determines if an error occurred. It also checks the running time of each script 30 days back and averages them.

I take the running time of the current script and see if it greater than 3 standard deviations than the average. If it is I report the running time is too far from average.

Is this the correct method for performing such a task? I feel as if I get far too many "running time too far from average" errors. Would increasing the sample size help, or does the 3 standard deviations rule not apply? I was under the assumption 99% of data lies within 3 standard deviations and a reliable way to detect outliers (a script that took a "hella long time" to run) would be to use this method.

score 7 · Answer 1 · edited Mar 09 '17 at 17:30

As Richard pointed out, the three-sigma "rule" only applies to the normal distribution (bell curve), which definitely doesn't apply since your runtimes can't be negative:

The log-normal distribution might possibly be more accurate, since it at least only allows for positive runtimes and might sort of make sense, anyway:

But rather than making a different "parametric" assumption about your runtimes (assuming some functional , you could use a non-parametric method instead. In particular, if you have a lot of previous runs, why not just look at the empirical CDF? This is just a function that, for a given x, gives you f(x) = the proportion of previously seen samples that are less than the given value. If you want 1% of runs to be notified, just look for runs that took longer than 99% of previous runs (the 99th percentile of the empirical distribution).

score 4 · Answer 2 · answered Aug 30 '12 at 22:26

The "rule" that only 1% of a sample is outside 3 standard deviations works when the distribution of values is a Gaussian (also known as "Normal") Distribution. I don't know what distribution you will be getting for your run times. But they are likely not to be Normal (they can not be less than zero). Also the different scripts may have different distributions -- and a mixture truly fouls up the "Normality". I will say nothing about the way run times tend to vary from time to time.

You should get a look at the distributions of each script's run-time -- get a histogram of a large sample. And only then start thinking about means and standard deviations. I hope that someone on this mailing list can suggest a transformation of run-time data that is likely to give a normal distribution.

Meanwhile: how about looking at the longest run time in each batch? Or a simple (ASCII style) graphic can be very helpful for spotting a problem. I used to use a dumb UNIX shell script like the following when I was monitoring the performance of our networks in the 1990's.

: Display flipped histogram on ASCII display
sort -n|uniq -c|awk '{for(i=2;i<=NF;i++)printf("%s\t",$i);
mpl=85-NF*8;
if($1>=mpl)plot=mpl; else plot=$1;
for(i=1;i<=plot;i++)printf("="); print ""}'

Algorithim to determine if point is "too far from the average"

2 Answers2