9

I would like to use PCA as a method of anomaly detection, however I'm wondering how this is done exactly (I'm using prcomp in R).

I'm really questioning the approach not the R code itself. Am I right in thinking I first run PCA on a bunch of data to find the lower dimensional subspace representation using the first $k$ PCs. Then as NEW data becomes available I reconstruct it using the $k$ PCs then examine the error. So if the error blows up I know the new data sample doesn't have the same 'structure' compared with the data used to build the PCs... and therefore it's different somehow... i.e. an anomaly.

Can someone tell me if I'm in the right ballpark with my assumption?

amoeba
  • 93,463
  • 28
  • 275
  • 317
PaulB.
  • 655
  • 3
  • 6
  • 10
  • I think this sounds about right, yes. – amoeba Feb 04 '17 at 22:50
  • reconstruction error is often referred to as the 'residual', which is a term often used in the context of using PCA in anomaly detection so may help you find more resources. – ReneBt Jul 31 '19 at 10:31

1 Answers1

6

Yes, you can do this. This method will measure the squared Euclidean distance between a new point and its projection onto the subspace found by PCA. It will give large values for outliers along directions orthogonal to the principal axes (point 1 in the example below), but not to outliers along them (point 2). Insensitivity to this second kind of outlier may be desirable or undesirable, depending on your application. The reconstruction error will give continuous values, so you'd need a way to choose the threshold for what counts as an outlier/anomaly.

enter image description here

user20160
  • 29,014
  • 3
  • 60
  • 99