identifying multiD outliers

Question

suppose you have two columns of data, one composed of numbers very close to 1 and numbers very close to 3, and the same for the other column, such that every 2D data point belongs to either "1,1" or "3, 3". Now lets say I "injected" one point (1,3) - this is a 2D outlier, but it is not an outlier in either x OR y alone.

There are many ways one can easily identify this point, e.g.:

Euclidean distances (this point will be far away from all the rest);

numerical multivariate Density approximation - the density there would be much smaller than all the others;

rounding the numbers to either 1 or 3 and then grouping by both columns (this will yield the clusters "1,1", "3,3" and "1,3" - the latter having only one member)

All these methods will be very cheap computationally.

Now lets say I want to do this in multidimensions, and I have lets say 10^6 rows and 1000 columns.

All the above methods fail, either in principle or due to computational cost.

For example, the density estimation methods are only good for ~10 dimensions;

Euclidean distance is both expensive and meaningless for so many dimensions;

Parting the space into groups will very quickly result in too many "cubes".

even PCA will most probably get rid of only part of the columns, and even if the dimensionality will be reduced significantly, outliers may be missed by the very approximation of throwing out these columns.

In clustering methods, I would have to choose the number of clusters, and I don't know it in advance - plus they will probably fail for exactly the same reasons.

Any ideas?

There is a search field on the top of the page. Feel free to use it, they don't charge you money for [that](https://stats.stackexchange.com/questions/213/what-is-the-best-way-to-identify-outliers-in-multivariate-data) — user603, Jul 29 '18 at 14:56
I have actually done a lot of research on the subject, and included in the body of the question all the methods that didn't work for me. And yes, I have seen the page you are referring to — Omry Atia, Jul 29 '18 at 14:58
if by 'I have done a lot of research on the subject' you mean the three methods you listed after your ', e.g.:'? — user603, Jul 29 '18 at 15:00
And methods derived from them, i.e. methods relying on distance measures, density measure and so on. This covers most of what I read already, and none of them seem satisfactory for very large dimensions — Omry Atia, Jul 29 '18 at 15:04
ok, good news then: you have not done your research right. Try the methods I have listed [here](https://stats.stackexchange.com/a/996/603). They are all widely implemented and free as in free beer & open source. They are not related to the methods you tried (in fact they were developed in the 80's as it become clear that various kitchen recipe like the ones you tried don't work). They will solve the problem you seem to have. Heck, If you post a sample of your data, I will illustrate one of them for you here. — user603, Jul 29 '18 at 15:14
It is a difficult task since in high-dimensional space, every point tends to be an outlier due to the curse of dimensionality. So even if you find an algorithm with acceptable computational cost, then the results might of no use. So maybe it could work to compress the dimensions first and *then* use some of the above mentioned approaches. — Michael M, Jul 29 '18 at 15:40

identifying multiD outliers

0 Answers0