2

I want to construct a multivariate model to find outliers in the data. The data I have is similar to the iris data (without the Species data attribute, I only have the first 4 attributes)

> head(iris)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

It seems like there are a few methods for multivariate outlier detection. The document is as in this link

  1. Mahalanobis Distance
  2. Cook’s Distance
  3. Leverage Point
  4. DFFITS

All of them seem to require building a regression line and I understand that regression implies dependent variable. However, how can I choose a dependent variable from my data given that it only has the first 4 numeric continuous columns?

Duy Bui
  • 139
  • 3

2 Answers2

0

You could try to use DBSCAN clustering to detect outliers in your data. The outlier class is usually assigned a -1 as the cluster.

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
  • Thank you. I saw a few tutorials on Density-based Spatial Clustering using DBSCAN on R. I think the outlier class is usually assigned as 0. What I concern is how to choose the best epsilon. Also, do we have anything to validate the clustering? (such as accuracy) – Duy Bui Mar 20 '17 at 16:37
  • You can use the knee method to calculate epsilon. This method uses the mean of distances between each point and its k nearest neighbors. The point where the knee begins to trend upward is your best epsilon. Alternatively you could use OPTICS which already optimizes epsilon. – jeweinb Mar 22 '17 at 13:09
  • Thanks. I tried "knee" approach and it doesn't work on my data. I adjusted the value of k (in kNN) from 5 to 1500 but the plot looks quite the same (click [here](https://drive.google.com/open?id=0B4fYMW1NjfHJNXd3UVVRTDJpT2M) for the plot). Therefore, I don't know how to choose the best epsilon. Also, I tried the OPTICS but in fact, you still need to choose epsilon using this approach. I still don't know how it will optimise the epsilon. For example: in my model, I chose minPts as 5, eps as 2000 and eps_cl as 1500 (it is advised that eps_cl <= eps) – Duy Bui Mar 22 '17 at 15:20
0

Your methods 2, 3, and 4 only make sense with a response/dependent variable. But 1 (Mahalonibis distance) can be computed use the mean vector and covariance matrix of the data columns without needing a regression model. Since you are trying to find outliers, you may want to use a "robust" estimate of the covariance matrix, or compute the mean and covariance matrix leaving out each point, then calculating the distance for the left out point (sometimes called studentizing).

You can also use principal components analysis as another approach for looking for unusual points (outliers) in multiple dimensions.

Greg Snow
  • 46,563
  • 2
  • 90
  • 159