7

I'm a programmer with a small statistics background and I need to find outliers in a small list of integers and floats.

After some search on google I found the Iglewicz and Hoaglin outlier test which creates a modified z-score Mi for every value in the list and check it against an threshold (normally 3.5).

$$M_{i} = \frac{0.6745(x_{i} - \tilde{x})} {\mbox{MAD}}$$

I wrote a litte python script to test it. At first it worked great, but after a few tests I spotted an error.

If you try to find outliers (with my script) in an list with many identically values and one outlier e.g. data = [10, 10, 10, 10, 10, 10, 10, 100] the MAD(median absolute deviation) becomes 0 and this leads my to my question: "What should I do if the MAD becomes 0?".

My first idea was to set the MAD to , but this causes the script to find no outliers.

My second idea was to add very small offsets to the values to make them unique e.g. data = [10.0, 10.00000001, 10.00000002, 10.00000003, 10.00000004, 10.00000004, 10.00000005, 100]. This way the MAD can't become 0 and my script is able to detect the outlier 100.

Does somebody have better ideas?

Am I doing something wrong?

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
szuuuken
  • 173
  • 1
  • 5
  • Could you expand a little on the specifics of the data. E.g. is the data assumed truncated at zero or ten? Are all non-deviant observations assumed to be concentrated at one value? Are outliers only in the right tail? – Jim Apr 11 '18 at 21:50
  • 3
    If I was writing code to do this I would just emit a message to the effect that MAD is zero; so all values not may be outliers. Everything $10$ except for $10 \pm 10^{-k}$ would necessarily be flagged. Anything else looks arbitrary. That's not to dispute that some people really do want automated scanning of many variables; they should be prepared for the need for extra judgments. – Nick Cox Apr 12 '18 at 12:20

2 Answers2

7

Three facts will help you here.

  • What you discovered is called the exact fit property. If a proportion $\alpha > 0.5$ of the observations in your sample have the same value, the mad of your sample will be 0.
  • This is not a property of the mad in particular, but of all robust estimators of scale. More precisely: any robust estimator of scale with a breakdown point of $0< \alpha < 0.5$ will have an exact fit property at the level of $1-\alpha$ (see section 3 of Croux et al., 2006, [0], for example).
  • Your first proposals amount to replacing the value of $M_i$ by arbitrary numbers in case of exact fit (setting $M_i=0$ in the former and $M_i=O(1/\sigma)$ --where $\sigma$ is the amount by which you perturb the data-- in the latter).

Your proposed solution to the problem (point 3) is not the correct one.

In fact, the correct solution to your problem is much simpler. Keep the MAD, keep the outliers rejection rule. All you need to do is to adopt the convention $0/0:=0$ in the computation of the outliers detection rule. This convention has no impact outside of exact fit cases. Then you can use the rule regardless of whether the MAD is strictly positive or not.

This is because:

In an exact fit situation whereby half or more of the data is tied at an arbitrary value $x$, all observations in your sample that are different from $x$ are severe outliers.

In such a situation, all observations in your sample that are different from $x$ are, after all, infinitely divergent from the pattern of the bulk of the data. Then, adopting the $0/0:=0$ will assign the correct outlyingness score both to those observation equal to $x$ ($M_i=0$) and those different from $x$ ($M_i=\infty$).

The reason you can use this convention is because the exact fit property is bijective:

Mad = 0 $\iff$ more than half of your sample are tied to the same value.

  • Algorithms for projection-pursuit robust principal component analysis. (2006). Croux, C. Filzmoser, P. and Oliveira, M. R.
user603
  • 21,225
  • 3
  • 71
  • 135
  • 3
    I upvoted this and I prefer this answer. This is a clean proposal, but (correct me if I am wrong) it doesn't make the procedure much more helpful when the MAD is zero. When that is so, values both minutely and enormously different from the median are all alike reported as outliers. I am not saying that there is or should be a Holy Grail that infallibly detects outliers, but the solution is mathematically tidy rather than practically useful. – Nick Cox Apr 12 '18 at 18:34
  • 2
    I like this answer because it clarifies the general issues in the exact fit situation. There is a danger in trying to devise a one-size-fits-all system for removing "outliers"; much depends on what is trying to be accomplished by the outlier removal. – EdM Apr 12 '18 at 18:42
  • @NickCox: if half the data is tied to the median, the majority of the data also has no scatter. Anything that deviates from the median (even minutely) is then arbitrarily far away from the pattern of the majority. Much like a black cat with a single white spot on the nose completely stands out in a clowder of fully black ones. One could argue that in such a situation your data is essentially discrete and the notion of distance much less reliable. The idea is to warn you to such occurrences. – user603 Apr 12 '18 at 19:05
  • 1
    I understand that. It's the problem, not a solution.. But the analogy is more that a black cat with one small white spot is treated exactly like a completely white cat. – Nick Cox Apr 12 '18 at 19:09
  • A special rule for the $0/0$ case is not enough. A rule is needed for the $x/0$ case ($x \neq 0$) as well. – Jim Apr 14 '18 at 14:08
  • @Jim. No: whenever the numerator of $M_i$ is non 0 and the denominator is 0, $i$ is an outlier (consider carefully the implication of the $\iff$ and the two paragraphs above it at the end of my answer) and the value of $M_i$ will be larger than 3.5. – user603 Apr 14 '18 at 16:02
2

1. A practical suggestion.

Change this part of the code

    if mad == 0:
        mad = 9223372036854775807 # maxint

to

    if mad == 0:
        mad = 2.2250738585072014e-308 # sys.float_info.min

It does the trick. Division by this number blows up the Iglewicz-Hoaglin test statistic – exactly as desired. That is, marking strongly deviant observations as outliers.


2. Previous practical suggestion.

What you could do, is check if it works with the closely related definition of mean absolute error (MAE):

$$ \text{MAE} = \frac{1}{n} \sum_{i=1}^n |x_i - \text{median}(x)|, $$

with $e_i = x_i - \text{median}(x)$ the errors (better: residuals, or, deviations).

IBM uses this variant:

$$ M_{i} = \frac{x_{i} - \text{median}(x)} { 1.253314 \cdot \text{MAE} } $$

for the if MAD == 0 case.


3. What is going on here? (From a programming perspective)

Consider the two cases:

  1. $0/0$,
  2. $x/0$ for $x \neq 0$.

Scientific programming languages R, Matlab and Julia have the following behavior:

  1. 0/0 returns NaN.
  2. 90/0 returns Inf.

Python, on the other hand, throws a ZeroDivisionError in both cases.

Practical suggestion one circumvents both cases for both flavors of zero-division handling.

Jim
  • 1,912
  • 2
  • 15
  • 20
  • 1
    Bad idea. The estimator you propose is no more robust to outliers than the SD. So it is useless for the purpose of interest here. – user603 Apr 11 '18 at 19:23
  • @user603 no, it is more robust b/c it doesn’t square the deviations. Wrt to the outer mean: it could be replaced with a trimmed mean which *is* of course even more robust. – Jim Apr 11 '18 at 19:41
  • The breakdown point of the measure your propose (MAE) is the same as that of the SD, namely: $1/n$. Hence it is not more robust, contrary to what you claim. And the squares have nothing to do with it (to see this: try increasing the robustness of the MAE by taking the nth root of the summands). Replacing the outer mean by another estimator (the trimmed mean) would solve the robustness problem. But in so doing the new estimator (with the trimmed mean) will again be subject to being 0 whenever the share of ties is greater than one minus the trimming proportion. See the citation in my answer. – user603 Apr 11 '18 at 19:44
  • @user603 Surely you are not claiming that *in general* or for the outer *mean* $(\cdot)^2$ is just as robust as $|\cdot|$. Why would we have [quantile regression](https://en.wikipedia.org/wiki/Quantile_regression) if OLS would be *outlier robust*? I believe you are interpreting the word "robust" a little too narrowly (or should I say *non-robustly*). – Jim Apr 11 '18 at 21:26
  • you are miss-reading my comments in the second sentence of your previous comment. It is *incorrect* to read my comments as saying that OLS is robust to outliers (it is not). You are correct to read my comment as implying that quantile regression is *as robust to outliers as* OLS. Indeed: both have a vanishing breakdown point and neither is robust to outliers. Quantile regression is meaningful in many situations in which OLS is not. But samples contaminated by outliers (which is what the OP is interested in) ain't one of them. – user603 Apr 11 '18 at 22:00
  • @user603 I propose we end this discussion. (One that I did not start) – Jim Apr 11 '18 at 22:04
  • @user603 I do not wish to restart the discussion, but I am curious if you could point me to a reference to back up the claim that *"[...] quantile regression is as robust to outliers as OLS."* – Jim Apr 12 '18 at 20:44
  • Sure, just google 'breakdown point quantile regression'. My first hit was [this](https://www.bauer.uh.edu/rsusmel/phd/ec1-25.pdf) document, where the claim is at page 14 (slide 27). Or any textbook on robust methods. I like [this](https://onlinelibrary.wiley.com/doi/book/10.1002/0470010940) one. Your current practical solution is right, the ibm one is wrong (for all the reasons explained in the chain of comments above). Like it or not, it is an *unavoidable* consequence of robust estimation of scale that it will be 0 whenever a large fraction of the sample is tied. – user603 Apr 12 '18 at 23:33
  • @user603 This confirms what I wrote previously. You seem to interpret the word "*robust*" as exclusively referring to "*breakdown point robust*". I believe the world [of statistics] is a little bit bigger than that. Further, both suggestions solve the OP's problem. And all versions of this answer contained the words "*practical suggestion*". Therefore, I kindly ask you to reconsider your downvote. And finally, I politely ask that you consider cleaning up the *graffiti* below this answer yourself. – Jim Apr 14 '18 at 21:16
  • You asked me to back my assertion that quantile regression is not more robust than OLS to outliers and I pointed you to the predominant formal measure of robustness to outliers. But I could have referred to other measures of robustness. QR (like OLS) has unbounded *influence function* (another formal measure of sensitivity to outliers). The asymptotic bias of QR (a third formal measure of sensitivity to outliers) like that of OLS, grows without bounds in the presence of a vanishingly small rate of contamination by outliers. – user603 Apr 15 '18 at 09:26
  • @user603 you made a lot of comments that you have since deleted. I haven't read them – sorry. The first source you cited contains this quote: *"Following Huber (1981), we will interpret robustness as insensitivity to small deviations from the assumptions the model imposes on the data."* I'll leave it at that. – Jim Apr 15 '18 at 18:30
  • Ironically, the notion of robustness used in the paper you now cite for support is the breakdown point. But all three measures of robustness I mentioned (breakdown point, influence function and asymptotic bias) as well as the results I commented on the relative robustness to outliers of QR and OLS are also presented in Huber 1981 (though, the results I cite are also available in the textbook I referenced earlier). – user603 Apr 15 '18 at 18:55