I have a data set of repeated observations and I am trying to determine if any of the observations are outliers. The research I've done has only shown methods that would determine if one value (maximum, minimum, or one questioned value) is an outlier, or if both the highest and lowest value are outliers. What I would like to be able show is if multiple values throughout the data set are outliers, as I suspect, without knowing exactly how many outliers are present. Any help or direction you could give me would be much appreciated.
-
sure. You refer to maximum and minimum. Can I infer that your dataset is univariate? – user603 Feb 26 '14 at 15:40
-
yes, it is a series of drug purities reported by multiple analysts. In essence, 50 people measured the purity of a sample 10 times each and reported it. 3 or 4 values visually appear to be outliers and i was hoping to use a statistical test to show that they are – Kscicc26 Feb 26 '14 at 15:42
-
1a box and whisker plot might be a nice place to start. http://www.r-bloggers.com/summarising-data-using-box-and-whisker-plots/ https://stat.ethz.ch/R-manual/R-devel/library/graphics/html/boxplot.html – Eric Peterson Feb 26 '14 at 16:17
-
...or even the [adjusted boplots](http://stats.stackexchange.com/questions/13086/is-there-a-boxplot-variant-for-poisson-distributed-data/13429#13429), explained here in the context of another question but it's also usable for your problem. – user603 Feb 26 '14 at 16:23
-
Normally, a data point is considered an outlier when it is outside of [Q1 - 1.5*interquartile range, Q3 + 1.5*interquartile range] – TYZ Feb 26 '14 at 17:22
-
The first question you have to ask yourself is: why are you interested in outliers? (It may be that outliers don't matter in the way you think they do.) The second question is: what model do you assume for how the points are generated? (You can then decide how you might decide what are "outliers".) There's no generic solution. – Wayne Feb 26 '14 at 22:29
3 Answers
You can draw a boxplot of the main outcome by occasion; the STATA code for doing so is: graph box "outcome", over("time/occasion") showyvars marker(1, msize(vsmall)) mark(1, mlab( "participantID" ))
Please, replace the variables in quotation with your variables. You have to ask the statistical software to label the markers/outliers.

- 21
- 5
You could remove one outlier at a time, and repeat the outlier test, as described in the Wikipedia entry for Grubbs' test.
If your data set is very small, you may end up removing all points though.

- 132,789
- 81
- 357
- 650

- 55,939
- 5
- 90
- 176
-
4This will not work for the reason explaine [here](http://stats.stackexchange.com/questions/46229/fast-linear-regression-robust-to-outliers/46234#46234) – user603 Feb 26 '14 at 16:53
-
Grubbs' test will work fine with 500 observation. if it's 3-4 outliers expected, it'll require 3-4 iterations. there's no mention of the regressions in the question, btw. – Aksakal Feb 26 '14 at 17:02
-
2Univariate location is just a particular case of regression. No it won't, feel free to ask this as a question, it's quiet easy to debunk. – user603 Feb 26 '14 at 17:17
-
@user603 It's not clear to me that the issues there cover this case. For example, the design space for univariate outliers is a vector of 1's, so the issue of influential outliers that regression has doesn't come in here. While I think there are better choices than Grubb's test for this ([e.g.](http://www.itl.nist.gov/div898/handbook/eda/section3/eda35h3.htm)), for the case where n=500 and there are a small number of outliers like 4, it does look like iterated application of Grubbs test tends to find all the large outliers. [A clump of small outliers ("just" outlying points) may be an issue.] – Glen_b Mar 31 '14 at 21:58
You need to define what is an outlier. The data points coming from your repeated measures have a mean and standard deviation.
I would consider a point as outlier, if it is located more than 3 or 4 times the standard deviation away from the mean of the distribution.
You could therefore remove all the data points that fulfills this criterion...

- 418
- 4
- 8
-
3you might want to read the counter example at the begining to [this](http://stats.stackexchange.com/a/56404/603) answer – user603 Mar 31 '14 at 20:04
-
Yes good point. But what do we do if the distribution is not symmetric? Median and means will be extremely different, no? – bonobo Apr 01 '14 at 20:06
-
if the good part of the data comes from an assymetric distribution, the median will work fine. The MAD won't. That's why we have these [alternatives](https://www.google.be/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&uact=8&ved=0CDIQFjAA&url=http%3A%2F%2Fweb.ipac.caltech.edu%2Fstaff%2Ffmasci%2Fhome%2Fstatistics_refs%2FBetterThanMAD.pdf&ei=msw7U9OyN82shQfC5YHADg&usg=AFQjCNHD-ka0w5UY9ONCT_ocT6bR8ebeIw&sig2=OQlIP_oPFuxrdXAjRXvJng&bvm=bv.63934634,d.ZG4) – user603 Apr 02 '14 at 08:39