suppose you have two columns of data, one composed of numbers very close to 1
and numbers very close to 3
, and the same for the other column, such that every 2D
data point belongs to either "1,1" or "3, 3". Now lets say I "injected" one point (1,3)
- this is a 2D outlier, but it is not an outlier in either x
OR y
alone.
There are many ways one can easily identify this point, e.g.:
Euclidean distances (this point will be far away from all the rest);
numerical multivariate Density approximation - the density there would be much smaller than all the others;
rounding the numbers to either 1 or 3 and then grouping by both columns (this will yield the clusters "1,1", "3,3" and "1,3" - the latter having only one member)
All these methods will be very cheap computationally.
Now lets say I want to do this in multidimensions, and I have lets say 10^6 rows and 1000 columns.
All the above methods fail, either in principle or due to computational cost.
For example, the density estimation methods are only good for ~10 dimensions;
Euclidean distance is both expensive and meaningless for so many dimensions;
Parting the space into groups will very quickly result in too many "cubes".
even PCA will most probably get rid of only part of the columns, and even if the dimensionality will be reduced significantly, outliers may be missed by the very approximation of throwing out these columns.
In clustering methods, I would have to choose the number of clusters, and I don't know it in advance - plus they will probably fail for exactly the same reasons.
Any ideas?