How to identify outliers and conduct robust PCA?

Question

I want to conduct a principal component analysis (PCA) in SPSS. One assumption for PCA is that there are no significant outliers. How can I identify outliers in SPSS?

I don't think there is any such "assumption". PCA is just a multivariate transformation. How well it works for your purposes may be affected by whether outliers are present, but that depends on your purposes as well as the data. But you are right that checking for outliers is sensible data analysis. I haven't used SPSS for some decades, so can't fill in myself. — Nick Cox, Oct 04 '13 at 09:28
Outliers can completely screw a classical PCA analysis and make it meaningless (e.g. yield a model with an arbitrarily bad fit to the genuine part of the data). You can see an example [here](http://stats.stackexchange.com/a/44263/603). There is an discussion of robust methods for PCA estimation [here](http://stats.stackexchange.com/a/33602/603). I'm not aware of SPSS implementations, but many user friendly ones are available in [R](http://cran.r-project.org/web/packages/rrcov/vignettes/rrcov.pdf). I would not recommend doing serious statistical analysis in a closed source package such as SPSS. — user603, Oct 04 '13 at 09:59
@user603, in respect to your last recommendation: shall we do any _serious_ document in OpenOffice and not Ms Office (the latter being cloused source), either? — ttnphns, Oct 04 '13 at 10:21
@ttnphns the level of complexity (and transparency) of the canonical spreadsheet --keeping track of expenses in a small business/home setting say-- is just not comparable with that of a modern data analysis task (involving cutting edge techniques such as outlier identification). I've seen large organization push spreadsheet complexity beyond the boundary of the canonical spreadsheet, but those stories usually end up in tears. — user603, Oct 04 '13 at 10:24
@ttnphns: Surely *serious* documents are written using LaTeX. — Scortchi - Reinstate Monica, Oct 04 '13 at 12:05
@user603: Does PCA mean anything & does it yield models? Genuine question - like Nick I'm used to thinking of it as just a multivariate transformation, but perhaps there's another view, PCA as something akin to factor analysis. — Scortchi - Reinstate Monica, Oct 04 '13 at 12:08
@Scortchi: without pondering on this debate, i'm wondering how is someone to use a data analysis tool --for *any* purpose-- when its results are liable to being swayed by a few observations, regardless of the sample size --putting the fitted eigen-vectors so far away from what they would have been without these data points as to make classical diagnostic tool completely useless. It's not like we don't have many real data examples of this kind of phenomenon... — user603, Oct 04 '13 at 12:15
@user603 As usual, I expect we (e.g. you, Scortchi and I) agree more closely than is apparent. Outliers can certainly make a PCA problematic or at least difficult to interpret, even though in principle, outliers can also be consistent with the correlation structure of the rest of the data. In practice, I might recommend PCA on transformed scales if outliers appeared to sway a PCA; robust estimation of PCA is certainly an alternative, but I'm guessing wildly a bit out of the OP's usual territory and unlikely to be available in SPSS (I guess wildly). — Nick Cox, Oct 04 '13 at 13:02
Usually _serious_ documents are distinguished by what you can read from them — FairMiles, Oct 04 '13 at 14:11
For the PCA thing, it mostly depends on what you are using the PCA for and how you pretend to interpret or use its results. Dimensionality reduction? Finding correlation patterns among variables? Discovering "hidden/latent" unmeasured variables? Significance test of groups of samples on PCA components? A single observation far from the "cloud" will tend to rule the first component (PCA tries to maximize heterogeneity of samples on each succesive dimension) and that may be just fine or cause trouble... — FairMiles, Oct 04 '13 at 14:19
@FairMiles: and what if you have more than "a single observations far from the main cloud"? Nick Cox: "even though in principle, outliers can also be consistent with the correlation structure of the rest of the data" why take the chance when you don't *have* to? Also in PCA, the effect of the outliers don't depend on their correlation structure but on their distance from the main body of the data... . — user603, Oct 04 '13 at 16:22
@user603: And how many is too much? I know is throwing the ball out of the court, but then we should start defining outliers and what makes _genuine_ some part of the data which multivariate description you don't want to disturb or become _meaningless_... — FairMiles, Oct 07 '13 at 22:04
@FairMiles: suprizingly, it's possible to be very precise about these things. The answer to your question is one: a *single* outlier can be detected on diagnostic plots of the residuals of a method based on LS. For any convex loss function, the maximum number of outliers that can be detected is $d+1$ (where $d$ is the dimensionality of your dataset). Don't hesitate to ask a question about these topics. — user603, Oct 08 '13 at 09:15
@user603: Oh, I see... you mean how some tools precisely identify an outlier after having precisely defined what an outlier is by using the practical criterion (i.e., tool design) to justify why it should/must/can not belong to a _genuine_ dataset. Yes, I know little about that, may ask eventually. I was rambling about knowing/deciding if the PCA result that OP wants in some particular case should include some multivariate points in the dataset and if SPSS should decide (e.g., http://stats.stackexchange.com/questions/15497). BTW, I consider PCA mostly a descriptive tool, maybe is that... — FairMiles, Oct 08 '13 at 19:20

score 7 · Answer 1 · answered Oct 04 '13 at 19:30

7

Robust PCA is a very active research area, and identifying and removing outliers in a sound way is quite delicate. (I've written two papers in this field, so I do know a bit about it.) While I don't know SPSS, you may be able to implement the relatively simple Algorithm (1) here.

This algorithm (not mine) has rigorous guarantees but requires only some basic computations and a "while" loop. Assuming you are searching for $d$ principal components, the basic procedure is

Compute PCA on your data,
Project your data on to the top $d$ principal components,
Throw away "at random" one of the data points whose projection is "too large", and
Repeat this "a few" times.

Everything in quotation marks is a heuristic; you can find the details in the paper.

The idea behind this procedure is that vectors whose projection after PCA is large may have effected the estimate too much, and so you may want to throw them away. It turns out that choosing the ones to throw away "at random" is actually a reasonable thing to do.

If anyone actually wants to take the time to write the SPSS code for this, I'm sure @cathy would appreciate it.

answered Oct 04 '13 at 19:30

Mike McCoy

715
4
6

1

+1, and nice of you to share the article! It might be indeed worth writing the code. The article is quite mathematical though (not easy to understand by data analysts like me). Can you point me to the main places there, particularly corresponding to the 4 stages of the algo you describe? – ttnphns Oct 04 '13 at 20:19
1

@ttnphns Algo 1 is at the top of p. 6. The steps above summarize step 2 in the algorithm. As for their notation, $\hat d$ is the number of principal components, $\bar{T}$ is the maximum number of loops (for safety, take $\bar{T}=n-1$, but smaller values usually won't hurt and you get the "best solution so far"), $\hat{t}$ is any lower bound on the number of good points (e.g., you can set $\hat t$ to be half the total number of points in most situations), $\bar{V}_{\hat{t}}(w)$ is the sum of the norms of the "smallest projections". – Mike McCoy Oct 04 '13 at 20:29
+1, but why not cite your own papers on the topic? I would be curious. – amoeba Jan 22 '15 at 17:24
3

@amoeba My papers (and a number of others) involve semidefinite programming, which can be rather laborious to explain. I wanted to give a simple-to-understand algorithm that is, in some sense, effective. – Mike McCoy Jan 23 '15 at 03:08
Do you know of a public implementation of this method? I've asked a separate question about it here. https://stats.stackexchange.com/questions/389204/open-implementation-of-xu-caramis-and-mannors-outlier-robust-pca – eric_kernfeld Jan 25 '19 at 22:11

How to identify outliers and conduct robust PCA?

1 Answers1

Linked