0

I have a matrix where the rows are the data points (samples) and the columns are the features (predictors). Let's say I have 1000 data points and 20 features, i.e. the matrix is of size 1000 x 20.

Now I want to detect and possibly remove outliers. I have read a good introduction: http://www.itl.nist.gov/div898/handbook/eda/section3/eda35h.htm

One possibility is for example to use the modified Z-score and remove everything with a value of above 3.5.

First, how should I apply this? Should I just calculate the modified Z-score for each row (data point) of the matrix and removing those rows which are flagged as outliers or should I calculate it for each column (feature)? The same problem I have with making plots (e.g. histograms)...

Second, which outlier detection method is best (possibly also for not normal distributed data)? There are so many. Simply methods like modified Z-score or just looking at the standard deviation seems to be often used.

machinery
  • 1,474
  • 4
  • 18
  • 30
  • You are confusing methods for univariate outlier detection and method for multivariate outlier detection. An observation can be a multivariate outliers without outlying in any of the particular variables taken individually. – user603 Jan 12 '16 at 18:17
  • @user603 ok, which method for univariate and multivariate outlier detection would you recommend? I think modified Z-score is a univariate outlier detection method. Could you briefly explain how I should apply it? – machinery Jan 12 '16 at 20:47
  • Have you checked the top answer to [this](http://stats.stackexchange.com/questions/213/what-is-the-best-way-to-identify-outliers-in-multivariate-data) question? – user603 Jan 12 '16 at 23:27

0 Answers0