9

I have small datasets of size 40-50 points. Without assuming that the data is normally distributed I wanted to find out the outliers with 90% confidence at least. I thought boxplot could be a good way to do that but I am not sure.

Any help appreciated.

Also with boxplot implementations I could not find a implementation which besides drawing the plot explicitly spits out the outliers.

mpiktas
  • 33,140
  • 5
  • 82
  • 138
Abhi
  • 191
  • 1
  • 1
  • 6
  • 7
    90% confidence of what? – Henry Apr 19 '12 at 23:54
  • What I also see sometimes is that researchers drop the top and bottom X % of their observations to reduce the influence of extreme cases. But I'm unsure whether I agree with it, it's quite arbitrary isn't it? – C. Pieters May 10 '12 at 19:54
  • You don't have to assume that your data are normally distributed, but since you know what data you're dealing with, you may be able to use another parametric distribution. For example, waiting times are often Poisson-distributed. Then it makes sense to say whether one Poisson data point likely to be generated by a given distribution of them. – Jack Tanner May 10 '12 at 21:24

4 Answers4

23

That's because such an algorithm can't exist. You require an assumed distribution in order to be able to classify something as lying outside the range of expected values.

Even if you do assume a normal distribution, declaring data points as outliers is a fraught business. In general, you not only need a good estimate of the true distribution, which is often unavailable, but also a good theoretically supported reason for making your decision (i.e. the subject broke the experimental setup somehow). Such a judgement is usually impossible to codify in an algorithm.

naught101
  • 4,973
  • 1
  • 51
  • 85
  • 11
    +1. Also, the use of "with 90% confidence" reveals a misunderstanding of the way the concept of confidence could apply in this case. Without a basis for a degree of confidence, there's no systematic way to quantify the level of confidence one might have. It would come down to an arbitrary thing, as if one were to say "I'm x% confident that this soup is too salty." – rolando2 Apr 20 '12 at 00:40
  • 6
    @rolando2, that is as it may be, but nonetheless, I'm 90% confident that's a good comment. – gung - Reinstate Monica May 10 '12 at 15:13
6

This does not directly answer your question, but you may learn something from looking at the outliers dataset in the TeachingDemos package for R and working through the examples on the help page. This may give you a better understanding of some of the issues with automatic outlier detection.

Greg Snow
  • 46,563
  • 2
  • 90
  • 159
2

R will spit out the outliers as in

dat <- c(6,8.5,-12,1,rnorm(40),-1,10,0)
boxplot(dat)$out

which will draw the boxplot and give

[1]   6.0   8.5 -12.0  10.0
Henry
  • 30,848
  • 1
  • 63
  • 107
  • I'm not too sure, and [the docs](http://astrostatistics.psu.edu/su07/R/html/grDevices/html/boxplot.stats.html) aren't that clear, but `$out` is just all the points which lie $(\$coef + 0.5)*(box length)$ away from the centre, or something, no? Doesn't that more or less assume a distribution? If not normal, then at least something symmetrical... – naught101 Apr 20 '12 at 00:45
  • 2
    No - the default definition of "outlier" for a boxplot is anything more than $1.5 \, IQR$ below the lower hinge or quartile or above $1.5 \, IQR$ above the higher hinge or quartile, where $IQR$ is the interquartile range. Since asymmetry will usually affect the relative position of the quartiles and median, you cannot say this assumes a symmetrical distribution. For something like an exponential distribution you will typically only see outliers at the high end, but this is what you would expect anyway. – Henry Apr 20 '12 at 06:37
  • That's more or less what I was trying to say (`$coef` is the 1.5 in your comment - see the docs). Outside the IQR though, it's symmetric, and is basically still assuming some kind of distribution. – naught101 Apr 21 '12 at 00:57
  • 4
    It's worth noting that finding points $>|1.5IQR|$ is something that should be expected to happen fairly often, and doesn't necessarily indicate any problems. – gung - Reinstate Monica May 10 '12 at 15:27
  • 5
    @gung: This is $1.5 IQR$ beyond the quartile, so about $2 IQR$ from the median for a symmetric distribution. It also depends on what you mean by "fairly often" and the distribution: almost never for a sample from a uniform distribution; about 0.7% of a sample from a normal distribution; about 5% for a sample from an exponential distribution; about 16% for a sample from a Cauchy distribution. – Henry May 10 '12 at 20:24
  • @Henry: you have a 0.7% chance of a *single* data point lying outside that range, but you have a $1-(0.993)^n$ chance on an $n$ sized dataset (ie about a 1/5 chance with sample size 30). – naught101 May 11 '12 at 01:02
  • @naught101: I don't disagree - my numbers were the expected proportions outside the range for large sample sizes: $0.007 \times 30 \approx 0.2$ – Henry May 11 '12 at 01:15
  • 1
    I remember having seen a brief paper on this a while ago, of course I can't find it now, but here's my thinking: I start w/ `2*(1-pnorm(4*qnorm(.75)))`, which returns `[1] 0.006976603`, the value you report above, but then I simulate as follows: `Set.seed(1); out = c();` `for(i in 1:100)` `x = rnorm(50)` `y = boxplot(x, plot=F)` `out[i] = length(y$out)>=1}` `sum(out)/100` which returns `[1] 0.3`. Ie, 30% of samples w/ $n=50$ will show as having outliers by this method, even though there actually aren't any. – gung - Reinstate Monica May 11 '12 at 01:20
  • 1
    @gung: `set.seed(1); out = c(); for(i in 1:100) {x = rnorm(500); y = boxplot(x, plot=F); out[i] = length(y$out)}; sum(out)/50000` gives `0.00738` which is closer to what I was describing – Henry May 11 '12 at 22:45
  • 1
    You've divided by 50000 to get that number, but note that you only iterated through the loop 100 times (ie, `for(i in 1:100)`), thus, at most you could have 100 instances w/ 'outliers'. Also, you changed the code to `length(y$out)` from `length(y$out)>=1`. What you want to know is what % of the time the boxplot procedure will show 'outliers' (ie, 1 or more) when there aren't any. The answer is that it will look like there are about 1 time in every 3. – gung - Reinstate Monica May 11 '12 at 23:30
1

As others have said you have stated the question poorly in terms of confidence. There are statistical tests for outlier's like Grubbs' test and Dixon's ratio test that I have referred to on another post. They assume the population distribution is normal although Dixon's test is robust to the normality assumption in small samples. A boxplot is a nice informal way to spot outliers in your data. Usually the whiskers are set at the 5th and 95th percentile and obsevations plotted beyond the whiskers are usually considered to be possible outliers. However this does not involve formal statistical testing.

Michael R. Chernick
  • 39,640
  • 28
  • 74
  • 143
  • 2
    Setting the whiskers at these fixed percentiles seems strange to me. Do you have a reference for this? (Tukey, who originated the boxplot, did not use this method: he set the whiskers either at the extremes, if they are sufficiently close to the quartiles, but no further than 1.5 "steps" (equal approximately to 1.5 times the IQR) out from the quartiles.) This is much more robust for outlier detection than using an extreme percentile, which--by definition--would *always* identify 10% of the data as "outliers," which wouldn't be a very useful procedure. – whuber May 10 '12 at 22:28
  • I don't know if I should have said usually. I think a lot of different points have been used for the whiskers. I think the 1st percentile and 99th have also been used and the min and max. But if you use min and max you can't find outliers beyond the whiskers. I have no specific reference that come to mind at the moment. I did not mean that anything outside the whiskers would be an outlier when the 5th and 95th percentiles are used. I just meant that visually you can see them because they will be far above or below the whiskers. – Michael R. Chernick May 10 '12 at 22:32
  • @whuber there is an assumption of normality when you set the limit to 1.5xIQR. So if your distribution isnt normal, that limit isnt really a good choice. https://towardsdatascience.com/why-1-5-in-iqr-method-of-outlier-detection-5d07fdc82097#:~:text=Well%2C%20as%20you%20might%20have,perceived%20as%20outlier(s). – Shervin Rad Aug 27 '21 at 13:02
  • 1
    @Shervin Tukey explicitly did not assume Normality, but he did use it as a reference for justifying that choice. Tukey also presumed that a preliminary re-expression of the data would be applied to make the distribution approximately symmetric. The spirit is one of EDA, which emphasizes robustness and making no distributional assumptions. As far as setting the limits in narrower situations goes, we have addressed this question in particular circumstances: see, for instance, https://stats.stackexchange.com/a/13101/919. – whuber Aug 27 '21 at 13:19