1

Taking a large (n >> 10 000) data set where the population is clearly not normal and detecting/testing for outliers using mean +/- 3 standard deviations.

Multiple colleagues of mine use this approach to detect outliers. I'm arguing that this is wrong and multiple sources state normality needs to be fulfilled otherwise the tests are inaccurate, but what is it that makes the above test inaccurate?

Additionally, if someone uses the above tests on a non-normal data and finds X outliers, can you draw any conclusions or is the only conclusions "the data set had X outliers, however, the data was not normal and therefore the test was inaccurate"?

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
MLEN
  • 89
  • 6
  • Related: [Is there a boxplot variant for Poisson distributed data?](https://stats.stackexchange.com/q/13086/), & [Outlier Detection on skewed Distributions](https://stats.stackexchange.com/q/129274/), especially the answers by user603. – gung - Reinstate Monica Mar 26 '20 at 16:08

1 Answers1

2

Outlier detection may not make sense for your data.

Consider a dataset of size 10,000 known to be taken from $\mathsf{Exp}(\mu = 10),$ as follows.

set.seed(325);  x = rexp(10^4, .1)

Not surprisingly, the sample mean and SD are both near 10.

summary(x)
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
  0.00108   2.90603   6.80087  10.08672  13.91946 103.96481 
sd(x)
[1] 10.20549

Thus $\bar X + 3S \approx 40.$

mean(x) + 3*sd(x)
[1] 40.70318

Accordingly, your criterion would "identify" 170 outliers.

sum(x > 40.7)
[1] 170

Nor does a boxplot do a useful job of identifying outliers here.

boxplot(x, horizontal=T, col="skyblue2")

enter image description here

For many practical purposes, a 'corrected' right-skewed dataset (devoid of its naturally-occurring high values) would give a seriously misleading impression.

Systematic pruning of 'outliers' is very frequently a bad idea, whatever criterion is used to 'identify' them.

If you could discuss your purpose in trying to get rid of outliers, perhaps we could be more helpful.

BruceET
  • 47,896
  • 2
  • 28
  • 76
  • The purpose is actually not getting rid of any outliers, just identifying them. Further down maybe some manual investigation is done. Do you mind elaborating the sentence with "...'correct' right skewed dataset"? – MLEN Mar 25 '20 at 18:45
  • My point was the my data were generated from the skewed exponential distribution. So the sample naturally has a long tail to the right and what appear to be outliers are giving correct information. – BruceET Mar 26 '20 at 17:11