4

What are the popular methods for outlier detection in univariate data, which do not assume normal distribution?

user27241
  • 393
  • 3
  • 10
  • 1
    Many, many relevant questions on outliers here: have you consulted several? You need to have a specific new question to have much chance of a detailed answer. In any case, I wouldn't put much faith in popularity of methods any more than I judge newspapers by their circulation. Dropping presumed outliers just because they appear inconvenient is, I guess, one of the most popular methods, but arguably the worst of all. – Nick Cox Jan 30 '15 at 15:01
  • 1
    One possibility is http://stats.stackexchange.com/questions/78063/replacing-outliers-with-mean/78067#78067 In that thread, as may happen here, the answers were wider than the question. – Nick Cox Jan 30 '15 at 15:16

1 Answers1

2

Generally, you should avoid trimming outliers in an ad hoc fashion and instead use nonparametric or robust alternatives. A recent review with Monte Carlo studies can be found in Bakker and Wicherts (2014). At least in psychology journals, Z-score cut-offs were most popular. Of course, I wouldn't recommend that; the simulation studies in the same article demonstrate that Z-score cut-offs can inflate Type I error rates.

Although the review is focused on independent samples t-tests, most of their recommendations will apply more broadly. They concluded with the following recommendations:

• Correct or delete erroneous values.

• Based on prior research, it is not recommended to use Z scores to identify outliers. We recommend methods that suffer less from masking like the IQR or the MAD-median rule instead.

• Decide on outlier handling before seeing the results of the main analyses, and if possible, preregister the study at, for example, the Open Science Framework (http://openscienceframework.org/).

• If preregistration is not possible, report the outcomes both with and without outliers or on the basis of alternative methods.

• Report transparently about how outliers were handled.

• Do not carelessly remove outliers as this increases the probability of finding a false positive, especially when using a threshold value of Z lower than 3 or when the data are skewed.

• Use methods that are less influenced by outliers like nonparametric or robust methods such as the Mann-Whitney-Wilcoxon test and the Yuen-Welch test, or researchers may choose to conduct bootstrapping (all without removing outliers).

References:

Bakker, M., & Wicherts, J. M. (2014). Outlier removal, sum scores, and the inflation of the type I error rate in independent samples t tests: The power of alternatives and recommendations. Psychological Methods, 19(3), 409-427.

Anthony
  • 1,564
  • 12
  • 24
  • 2
    This is good advice, although I would greatly widen the list of approaches worth considering. I note that it does not answer the question as it says precisely nothing about popularity. As I consider that a dubious criterion, I am happy to upvote. – Nick Cox Jan 30 '15 at 15:14
  • @Nick Cox, good points. I've edited the answer. It now makes clear that the cited article focused mainly on outliers with independent samples t-tests (that's the reason they emphasize Mann-Whitney-Wilcoxon and Yuen-Welch tests.) – Anthony Jan 30 '15 at 15:18
  • @NickCox thanks for the link, it is quite useful. I have records of n individuals behavior expressed in seconds. I want to apply a clustering algorithm to identify some behavioral groups, however I found that some individuals did other behavioral activities in the recorded time, and therefore the distribution is rightly skewed. I believe that this can affect my conclusions and the whole analysis. Therefore I'm just trying to find a way how to reduce influence of such records. – user27241 Jan 30 '15 at 15:47
  • You have a more specific question then. So either edit this question or (better, I think) start a new thread. However, I would advise expanding this outline greatly. – Nick Cox Jan 30 '15 at 15:51