How accurate is IQR for detecting outliers

Question

I'm writing a script that analyses run times of processes. I am not sure of their distribution but I want to know if a process runs "too long". So far I've been using 3 standard deviations of the last run times (n>30), but I was told that this does not provide anything useful if the data is not normal (which it does not appear to be). I found another outlier test that states:

Find the inter quartile range, which is IQR = Q3 - Q1, where Q3 is the third quartile and Q1 is the first quartile. Then find these two numbers:

a) Q1 - 1.5*IQR b) Q3 + 1.5*IQR

The point is an outlier if < a or > b

My data tends to be things like 2sec, 3sec, 2sec, 5sec, 300sec, 4sec, .... where 300sec is obviously an outlier.

Which method is better? The IQR method or the std deviation method?

You may want to check out @user603's answer here: [is there a boxplot variant for poisson distributed data](http://stats.stackexchange.com/questions/13086//13429#13429) for info on how to adjust this rule for skewed data. — gung - Reinstate Monica, May 28 '13 at 14:40
This "IQR" method was never intended to be applied blindly. It is part of a process of exploratory data analysis (as described by Nick Cox in his answer) during which you would first find a way to re-express the data to make them approximately symmetrically distributed. — whuber, May 28 '13 at 14:58
Based on your comments to the answers, the correct reply is "neither," because your underlying concern is not about outliers, it's about the *process.* — whuber, May 28 '13 at 16:35
Related: [Detecting outliers using standard deviations](http://stats.stackexchange.com/questions/38001/detecting-outliers-using-standard-deviations) is the flip side of this question — user56reinstatemonica8, Oct 15 '13 at 11:20
The numbers are time_taken so they will never be symmetric unless you re-scale them somehow. — JP Bennett, Jun 04 '18 at 17:42

Nick Cox · Answer 1 · 2013-05-28T16:07:37.597

There really are entire books on outliers.

The usual specific answer is as that the standard deviation is pulled up by outliers, so any rule based on the SD may perform poorly.

The Tukey rules on quartiles +/- 1.5 IQR you quote came out of handwork with small and moderate-sized datasets in the 1970s, and were designed to indicate values you might want to think about individually. It is not clear that they carry over to much larger datasets, nor that they apply when you expect considerable skewness.

A more general answer is that an outlier rule is good if it always makes the right decisions, but how can you tell?

This is contentious territory, but I'd expect an outlier to stick out on a graph as being very different from others. But it is often (usually?) a tough call to tell the difference between what you expect in a heavy-tailed distribution and what is too wild to regard as anything but an outlier. Sometimes transformation makes an outlier look much more ordinary.

Furthermore, if you use robust methods you might worry a bit less about precisely which values merit being called outliers, but worry rather about outliers in general.

score 1 · Answer 2 · answered May 28 '13 at 14:54

1

You say you're not sure of the distribution but processes that are ongoing are easy to collect and assess for distribution. Just save a bunch of times and analyze those. Given the times you posted you could get lots in a few hours.

Your search for a rule for an outlier needn't be so general. It can be specific to your task. You're able to collect lots of data. Collect it, examine it, and then decide when a process is too long. Maybe an IQR based approach will work but you can use your data set, or a parametric fit, to do simulations and see if it works well. The same goes for SD. It might just be that >50s is too long and that's all you need.

answered May 28 '13 at 14:54

John

21,167
9
48
84

I am collecting data on several processes. They may each have different distributions. I just need a simple way to say "running time too great" to alert the technicians to further look into things. It can be general so long as it flags things that should be flagged. If a few false positives appear, so be it. However false positives should be kept to a minimum since if there are too many it defeats the purpose of the script and I should just dump all the results and let the techs have at it. The purpose of the script is to "narrow things down" – chris bedd May 28 '13 at 15:28
You can assess whether the processes are same or different. If they're really very different some general rule may tend to cause a particular process to trigger the warning more frequently than necessary. This information should really be in your question. – John May 28 '13 at 15:32
3

Characterizing this problem as a search for outliers, chris, does it injustice: you are actually tackling a *quality control* problem. The principal distinctions are (1) you have an ongoing stream of data rather than a static dataset to analyze and (2) you intend to specify periodic actions to take as a result of each analysis: that is, whether to intervene (and attempt to improve the process) or not (and let the process run as is). Understanding that this is the nature of your problem shows that the huge literature on quality control is relevant, providing a rich assortment of solutions. – whuber May 28 '13 at 16:33
+1 @whuber . Outliers are not relevant here. Neither the average run time, nor any percentile of it, is related to what is "too long". The way to find out what is "too long" may be a survey of users, or a check with engineers, or just seat of the pants guessing, or something else, but it's not a statistical question. – Peter Flom May 28 '13 at 19:44

How accurate is IQR for detecting outliers

2 Answers2

Linked