Regression without a dependent variable

Question

I want to construct a multivariate model to find outliers in the data. The data I have is similar to the iris data (without the Species data attribute, I only have the first 4 attributes)

> head(iris)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

It seems like there are a few methods for multivariate outlier detection. The document is as in this link

Mahalanobis Distance
Cook’s Distance
Leverage Point
DFFITS

All of them seem to require building a regression line and I understand that regression implies dependent variable. However, how can I choose a dependent variable from my data given that it only has the first 4 numeric continuous columns?

Yeah, I thought it was a clustering problem. But I soon realised that I would have to manually choose which cluster is outlier in the end, which I think is not ideal. — Duy Bui, Mar 17 '17 at 10:27

score 0 · Answer 1 · edited Apr 08 '17 at 19:00

0

You could try to use DBSCAN clustering to detect outliers in your data. The outlier class is usually assigned a -1 as the cluster.

edited Apr 08 '17 at 19:00

kjetil b halvorsen

63,378
26
142
467

answered Mar 17 '17 at 14:26

jeweinb

1

Thank you. I saw a few tutorials on Density-based Spatial Clustering using DBSCAN on R. I think the outlier class is usually assigned as 0. What I concern is how to choose the best epsilon. Also, do we have anything to validate the clustering? (such as accuracy) – Duy Bui Mar 20 '17 at 16:37
You can use the knee method to calculate epsilon. This method uses the mean of distances between each point and its k nearest neighbors. The point where the knee begins to trend upward is your best epsilon. Alternatively you could use OPTICS which already optimizes epsilon. – jeweinb Mar 22 '17 at 13:09
Thanks. I tried "knee" approach and it doesn't work on my data. I adjusted the value of k (in kNN) from 5 to 1500 but the plot looks quite the same (click [here](https://drive.google.com/open?id=0B4fYMW1NjfHJNXd3UVVRTDJpT2M) for the plot). Therefore, I don't know how to choose the best epsilon. Also, I tried the OPTICS but in fact, you still need to choose epsilon using this approach. I still don't know how it will optimise the epsilon. For example: in my model, I chose minPts as 5, eps as 2000 and eps_cl as 1500 (it is advised that eps_cl <= eps) – Duy Bui Mar 22 '17 at 15:20

score 0 · Answer 2 · answered Nov 04 '20 at 22:49

Your methods 2, 3, and 4 only make sense with a response/dependent variable. But 1 (Mahalonibis distance) can be computed use the mean vector and covariance matrix of the data columns without needing a regression model. Since you are trying to find outliers, you may want to use a "robust" estimate of the covariance matrix, or compute the mean and covariance matrix leaving out each point, then calculating the distance for the left out point (sometimes called studentizing).

You can also use principal components analysis as another approach for looking for unusual points (outliers) in multiple dimensions.

Regression without a dependent variable

2 Answers2