0

I am trying to implement an outliner detection using zscore calculation from scipy.stats in python.

I was thinking a border around the data with 2 standard deviations should be fine to detect outliners. But it is not that easy.

I.e. when I have the following data I get the following zscores.

import scipy.stats as stats
test = [135.77,135.77,135.77,135.77,135.77,135.77,135.77,135.77,135.77,135.78]

print(stats.zscore(test))

[-0.33333333 -0.33333333 -0.33333333 -0.33333333 -0.33333333 -0.33333333
-0.33333333 -0.33333333 -0.33333333  3.        ]

Please note the 3 for the last value which is just 0.01 higher because the previous values are exactly the same that is the result of zscore.

On the other hand for the following values contain extrem outlines but are below 3.

test = [135.0, 135.86, 135.5, 134.96, 135.5, 135.68, 135.41, 134.96, 135.68, 135.68, 
0.0, 135.77, 135.05, 135.32, 135.68, 135.77, 135.05, 135.86, 0.0, 0.0]

print(stats.zscore(test))

[ 0.41067506  0.42845544  0.42101249  0.40984807  0.42101249  0.42473397
0.41915175  0.40984807  0.42473397  0.42473397 -2.3804309   0.4265947
0.4117088   0.41729102  0.42473397  0.4265947   0.4117088   0.42845544
-2.3804309  -2.3804309 ]

Any ideas on how to detect outliner using zscore in a reliable way?

Martin S
  • 3
  • 2

1 Answers1

1

Let's look at your data.

>>> from statistics import mean, stdev
>>> test = [135.77,135.77,135.77,135.77,135.77,135.77,135.77,135.77,135.77,135.78]
>>> mean(test), stdev(test)
(135.77100000000002, 0.0031622776601655032)

The mean is approximately 135.77, since nearly all the values are equal to 135.77 and the only one that is not is very close to it. Standard deviation is very small for the same reason. In such case, the margin of three standard deviations would be very thin. Everything works as expected.

Now let's look at the second example.

>>> test = [135.0, 135.86, 135.5, 134.96, 135.5, 135.68, 135.41, 134.96, 135.68, 135.68,
... 0.0, 135.77, 135.05, 135.32, 135.68, 135.77, 135.05, 135.86, 0.0, 0.0]
>>> mean(test), stdev(test)
(115.13650000000001, 49.62444254227526)

You have a lot of values close to 135 and few zeros. Since zeros are distant from 135, they have strong impact on the mean. As you can learn from the If mean is so sensitive, why use it in the first place? thread, this is a feature, not a bug. The zeros drag the mean down to 115.14. After $z$-scaling again, most of the values are close to zero as expected. For the zeros, they again are pretty far from the mean, hence they have rather high $z$-scores.

Why in the second example the $z$-scores are not higher than three? Answering with the question: why would they? The three standard deviations rule comes from normal distribution, where for normally distributed data, 99% of the data would lie within the three standard deviations from the mean. In your case, first of all can you assume that the data is normally distributed? Second, 3/20 or 15% of the samples are zeroes. Even if you assumed normal distribution, mean and standard deviation estimated from the data wouldn't allow for 15% of samples being outside three standard deviations from the mean, so you are making an assumption that is contradictory with what you are doing.

If you want to detect outliers than using $z$-scores is not the best idea, at least not if you cannot assume that the data is approximately distributed. Even if you did, you need robust (see ) estimators of mean and standard deviation, that are not influenced by the outliers when calculating the parameters.

Tim
  • 108,699
  • 20
  • 212
  • 390