I want to construct a multivariate model to find outliers in the data. The data I have is similar to the iris data (without the Species data attribute, I only have the first 4 attributes)
> head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
It seems like there are a few methods for multivariate outlier detection. The document is as in this link
- Mahalanobis Distance
- Cook’s Distance
- Leverage Point
- DFFITS
All of them seem to require building a regression line and I understand that regression implies dependent variable. However, how can I choose a dependent variable from my data given that it only has the first 4 numeric continuous columns?