5

I have a data set with high-dimensional feature space. Are there any pre-processing methodologies that can detect outliers from this data set? The outlier, I mean, are the ones that tend to be very different with other data points and can be the ones that generated by improper measurements.

chl
  • 50,972
  • 18
  • 205
  • 364
bit-question
  • 2,637
  • 6
  • 25
  • 26

3 Answers3

9

(classical) Mahalanobis distances cannot be used to find outliers in data because the Mahalanobis distance themselves are sensitive to outliers (i.e., they will always by construction sum to $(n-1)\times p$, the product of the dimensions of your dataset).

The recommended solution depends on what you mean by 'high dimensional'. Denoting $p$ the number of variables and $n$ the number of observations,

broadly, for $10\leq p \leq 100$, the current state of the art approach for outlier detection in high dimensional setting is the OGK algorithm of which many implementations exists.

The numerical complexity of the OGK in $p$ is $\mathcal{O}(np^2+p^3)$, very close to the theoretical lower bound for this problem because methods based on convex loss functions cannot reliable detect outliers if there are more than $n/p$ of them.

If your number of dimensions is much larger than the 100's, you will have to first do a (robust) dimension reduction of your data-set.

Here, the current state of the art methods is the PCA-grid approach, which again, is well implemented, see also here for direct estimation of the co-variance matrix.

An alternative procedure is the ROBPCA approach, which is, again, well implemented.

Notice that neither of the three procedures requires $n$ to be larger than $p$. The last two methods have complexity in the order of $\mathcal{O}(nk^3)$ where $k$ is the number of components one wishes to retains (which in high dimensional data is often much smaller than $p$).

Relative merits:

The big advantage of ROBPCA is its computational efficiency in high dimensions.

The big advantage of PCA-Grid is that it can perform both robust estimation and hard variable selection (in the sense of giving exactly 0 weight to a subset of the variables) see here for a link and the sPCAgrid() function in the pcaPP R package.

user603
  • 21,225
  • 3
  • 71
  • 135
  • 4
    Excellent answer, just one point to make: Using the word "obvious" in contexts like this seems to me a mistake. If it is obvious to a reader, then the reader knows it is obvious. If it is NOT obvious to a reader, then he/she may feel insulted. I think words like "obvious" should be deleted from our "teaching vocabulary". – Peter Flom Jan 12 '12 at 12:30
  • @ Peter Flom:> corrected. (sorry for the "obvious", i was a bit ticked by the answer suggesting using the Mahalanobis distances --t'was before morning coffee). – user603 Jan 12 '12 at 13:28
  • The first link to ROBPCA is not working. – mpiktas Jan 12 '12 at 13:45
  • 1
    Also the link to R package which implements ROBPCA is outdated. The current version of `rrcov` package is 1.3 and the function implementing ROBPCA is now called `PcaHubert`. +1 for the answer though, very good overview. – mpiktas Jan 12 '12 at 13:58
  • http://dx.doi.org/10.1007%2F978-3-540-69497-7_27 is a weighted-PCA also meant to be less sensitive to outliers. It's coming from a data mining perspective; so it tends to use a different language than the classic statisticians. But it's fairly low-cost over the regular PCA and can still improve things in real world data. – Has QUIT--Anony-Mousse Jan 14 '12 at 13:06
  • @Anony-Mousse: would you have a link to an implementation? I'd like to take that algorithm out for a drive. – user603 May 27 '14 at 16:18
  • 1
    @user603 I believe it is available in ELKI. – Has QUIT--Anony-Mousse May 27 '14 at 18:47
1

A basic approach is to use Mahalanobis distance, and look for data that are more extreme than you would expect. Mahalanobis distance is a multidimensional measure that takes into account the pattern of intercorrelations amongst your variables. One issue to bear in mind, is that this assumes your data are continuous; I don't think it would be appropriate if you had a discrete variable (e.g., male vs. female).

As for what is more extreme than you expect, that depends in part on your sample size. For example, a lot of people think you should trim data that are more that 2 z's, but realistically, if you have 100 data points, it's reasonable to expect that 5 should be beyond that threshold without anything being amiss.

It's also worth your while to think hard about your variables without (i.e., before) looking at your data. For example, it's often the case that you can simply intuit the proper range of values. (If you have a person whose height is listed as 7'10", that's probably a typo.)

gung - Reinstate Monica
  • 132,789
  • 81
  • 357
  • 650
1

It might be worth looking at one-class-SVM, which attempts to construct the smallest hypersphere that encloses as large a proportion of the data as possible, which has been used successfully in the machine learning community for novelty detection, so its use for outlier detection seems reasonable. I suspect there are non-linear versions (being a kernel method), which means that the sphere in the kernel-induced feature space corresponds to the exterior of the manifold containing the data in the attribute space, which doesn't necessarily have to be hyper-spherical. Caveat: I've seen this method described in conference and journal papers, but have not used it in anger myself, so your mileage may vary, but probably worth looking into.

Dikran Marsupial
  • 46,962
  • 5
  • 121
  • 178