34

In principal component analysis (PCA), one can choose either the covariance matrix or the correlation matrix to find the components (from their respective eigenvectors). These give different results (PC loadings and scores), because the eigenvectors between both matrices are not equal. My understanding is that this is caused by the fact that a raw data vector $X$ and its standardization $Z$ cannot be related via an orthogonal transformation. Mathematically, similar matrices (i.e. related by orthogonal transformation) have the same eigenvalues, but not necessarily the same eigenvectors.

This raises some difficulties in my mind:

  1. Does PCA actually make sense, if you can get two different answers for the same starting data set, both trying to achieve the same thing (=finding directions of maximum variance)?

  2. When using the correlation matrix approach, each variable is being standardized (scaled) by its own individual standard deviation, before calculating the PCs. How, then, does it still make sense to find the directions of maximum variance if the data have already been scaled/compressed differently beforehand? I know that that correlation based PCA is very convenient (standardized variables are dimensionless, so their linear combinations can be added; other advantages are also based on pragmatism), but is it correct?

It seems to me that covariance based PCA is the only truly correct one (even when the variances of the variables differ greatly), and that whenever this version cannot be used, correlation based PCA should not be used either.

I know that there is this thread: PCA on correlation or covariance? -- but it seems to focus only on finding a pragmatic solution, which may or may not also be an algebraically correct one.

ttnphns
  • 51,648
  • 40
  • 253
  • 462
Lucozade
  • 619
  • 1
  • 6
  • 7
  • 6
    I'm going to be honest and tell you I quit reading your question at some point. PCA makes sense. Yes, the results may be different depending on whether you choose to use the correlation or variance/covariance matrix. Correlation based PCA is preferred if your variables are measured on different scales, but you don't want this to dominate the outcome. Imagine if you have a series of variables that range from 0 to 1 and then some that have very large values (relatively speaking, like 0 to 1000), the large variance associated with the second group of variables will dominate. – Patrick Jun 27 '13 at 00:30
  • I changed the title, to mark the difference with previous questions on the topic. I hope the new title is OK. – Gala Jun 27 '13 at 07:09
  • 1
    @ Patrick: (1) please read the full question before answering, as a courtesy & generally sensible approach. (2) Your example illustrates the point: if I convert the [0,1000] interval to dBA or any log scale, the data now range from -\infty to 30, i.e., the values originally close to zero (say, 0.001) are stretched and get much further away from the new (log) center than does the original 1000. Scaling (including dividing by individual s.d) enables data points -- particularly outliers -- to be moved to almost anywhere. This is the case even of all variables are measured on the same scale. – Lucozade Jun 27 '13 at 09:23
  • 4
    But that's the case with many other techniques as well and I think Patrick's point is reasonable. Also it was merely a comment, no need to become aggressive. Generally speaking, why would you assume that there should be one true “algebraically” correct way to approach the problem? – Gala Jun 27 '13 at 09:55
  • 1
    @ Gael: Because both approaches claim to solve the same problem (see pt. 1 of my answer to ttnphs). Moreover, in e.g. linear regression, there are a set of specific conditions that must be satisfied to be able to use the method. Between cov-PCA and corr-PCA, I have not yet seen (a) clear rule(s) or division when each of these should/should not be applied, how both methods diverge/converge under which conditions, etc. PS: I did not intend any agression; on the contrary. Perhaps this rather applies to anyone who writes "I quit reading your question", but still comments nevertheless. – Lucozade Jun 27 '13 at 11:17
  • 6
    Perhaps you're thinking of PCA in the wrong way: it's just a transformation, so there's no question of its being correct or incorrect, or relying on assumptions about the data model - unlike, say, regression or factor analysis. – Scortchi - Reinstate Monica Jun 27 '13 at 11:37
  • 6
    The crux of this matter appears to rest on a misunderstanding of what standardization does and how PCA works. This is understandable, because a good grasp of PCA requires visualization of higher-dimensional shapes. I would maintain that this question, like many other questions based on some sort of misapprehension, is thereby a *good* one and ought to remain open, because its answer(s) can reveal truths that many people might not have fully appreciated before. – whuber Jun 27 '13 at 14:36
  • 7
    PCA does not “claim” anything. People make claims about PCA and in fact use it very differently depending on the field. Some of these uses might be silly or questionable but it does not seem very enlightening to assume that a single variant of the technique must be the “algebraically correct” one with no reference to the context or goal of the analysis. – Gala Jun 27 '13 at 20:54

3 Answers3

33

I hope these responses to your two questions will calm your concern:

  1. A correlation matrix is a covariance matrix of the standardized (i.e. not just centered but also rescaled) data; that is, a covariance matrix (as if) of another, different dataset. So it is natural and it shouldn't bother you that the results differ.
  2. Yes it makes sense to find the directions of maximal variance with standardized data - they are the directions of - so to speak - "correlatedness," not "covariatedness"; that is, after the effect of unequal variances - of the original variables - on the shape of the multivariate data cloud was taken off.

Next text and pictures added by @whuber (I thank him. Also, see my comment below)

Here is a two-dimensional example showing why it still makes sense to locate the principal axes of standardized data (shown on the right). Note that in the right hand plot the cloud still has a "shape" even though the variances along the coordinate axes are now exactly equal (to 1.0). Similarly, in higher dimensions the standardized point cloud will have a non-spherical shape even though the variances along all axes are exactly equal (to 1.0). The principal axes (with their corresponding eigenvalues) describe that shape. Another way to understand this is to note that all the rescaling and shifting that goes on when standardizing the variables occurs only in the directions of the coordinate axes and not in the principal directions themselves.

Figure

What is happening here is geometrically so intuitive and clear that it would be a stretch to characterize this as a "black-box operation": on the contrary, standardization and PCA are some of the most basic and routine things we do with data in order to understand them.


Continued by @ttnphns

When would one prefer to do PCA (or factor analysis or other similar type of analysis) on correlations (i.e. on z-standardized variables) instead of doing it on covariances (i.e. on centered variables)?

  1. When the variables are different units of measurement. That's clear.
  2. When one wants the analysis to reflect just and only linear associations. Pearson r is not only the covariance between the uniscaled (variance=1) variables; it is suddenly the measure of the strength of linear relationship, whereas usual covariance coefficient is receptive to both linear and monotonic relationship.
  3. When one wants the associations to reflect relative co-deviatedness (from the mean) rather than raw co-deviatedness. The correlation is based on distributions, their spreads, while the covariance is based on the original measurement scale. If I were to factor-analyze patients' psychopathological profiles as assesed by psychiatrists' on some clinical questionnaire consisting of Likert-type items, I'd prefer covariances. Because the professionals are not expected to distort the rating scale intrapsychically. If, on the other hand, I were to analyze the patients' self-portrates by that same questionnaire I'd probably choose correlations. Because layman's assessment is expected to be relative "other people", "the majority" "permissible deviation" or similar implicit das Man loupe which "shrinks" or "stretches" the rating scale for one.
ttnphns
  • 51,648
  • 40
  • 253
  • 462
  • 1
    1. Sorry, but this bothers a lot. To an external individual, the standardization is a black-box operation, part of the PCA pre-conditioning of data (also in ICA). He wants one answer for his (raw) input data, especially if it relates to physical (dimensioned) data for which the PCA output needs to be interpreted physically (i.e., in terms of unstandardized variables) as well. – Lucozade Jun 27 '13 at 09:29
  • 2. My understanding is that PCA maximizes variance (Joliffe, p. 2); covariance and correlation (do they have directions??) are not a primary concern or target, they are removing by diagolization of the correlation/covariance matrix anyway. If you take away the unequality of variances that *defines* the shape of the cloud, how can one still claim to find its direction(s) of maximum extent? – Lucozade Jun 27 '13 at 09:37
  • `PCA maximizes variance; covariance and correlation... are not a primary concern` PCA maximizes out of _multivariate_ variance, i.e. out of variance + co-variance. The shape of a data cloud lying in concrete dimensions (the variables) is described ("defined") by the variance-covariance matrix. If the variances along those dimensions are forced to be all equal the shape changes, but it can still remain ellipsoid and worth PC-analysing. – ttnphns Jun 27 '13 at 14:34
  • @Lucozade: point taken, and I am sorry for rudely not reading your question fully. I am not sure why standardization of variables is a "black box operation" as it is simply centering and standardizing. Would you be willing to offer an example of how a corrlelation-based PCA could be difficult to interpret relative to a covariance based PCA. It seems to me that if you have issues with the standardization, then you have the choice not to. The first step when considering which PCA to use is whether or not it is appropriate/required to standardize. – Patrick Jun 27 '13 at 20:35
  • Not sure if you use R, but this is a decent example I think of why/when standardizing is necessary: data(mtcars); biplot(prcomp(mtcars)); head(mtcars) # note the relatively large values for "disp" and "hp"; biplot(prcomp(mtcars, scale=T)) – Patrick Jun 27 '13 at 20:36
  • mtcars["PC1"] – Patrick Jun 27 '13 at 20:56
  • @Patrick: in short, my application is about finding a set of optimum spatial positions of a machine, whose performance in measured in terms of a voltage that each position produces, under different observations (parameter variation). The data have a large dynamic range (noise-like), though not considered as outliers. If I run corr_PCA (= on standardized voltage), it recommends different positions from those using cov_PCA (= centered voltages). I checked both solutions; the cov_PCA gives better performance than corr_PCA. If I unstandardize the corr_PCA solutions, it gives the cov_PCA. – Lucozade Jun 28 '13 at 10:13
  • Thank you all for your comments and inputs; very inspiring. I have highlighted a third part, to steer discussion back to the key issue (for me): corr_PCA vs. cov_PCA. @Gael: point taken; when I said focus on variance, indeed it does not mean ignoring covariance. – Lucozade Jun 28 '13 at 10:15
  • 1
    Your latest revision appears to be a re-assertion that "covariance based PCA is the only truly correct one". As the entirety of the responses so far are in essence "No; wrong way to think about it; and here's why" it is difficult to know how you expect to steer discussion against such overwhelming disagreement. – Nick Cox Jun 28 '13 at 10:22
  • @ttnphns: thanks for picture, which is indeed the classic way of looking at effects of scaling on PCA (also its effect on slope coefficient in linear regression). But I respectfully disagree with your text: (i) each individual PC direction should have full meaning, not combining two or more to get shape. E.g., if you are only interested in 1 dominant PC, you are looking for a direction, not an ellipse. (ii) if scaling affects in the direction of coordinate axes, this also affects the PC directions themselves, because the coordinates are influence the PC loadings. – Lucozade Jun 28 '13 at 10:24
  • @ Nick: no, I am well prepared to be proven wrong. But I am looking for a (to me) convincing argument or demonstration that corr_PCA really does/can give an optimum solution, over and above that obtained for cov_PCA -- or indeed vice versa. I know how to convert between the two solutions. – Lucozade Jun 28 '13 at 10:34
  • 4
    @Lucozade: I was confused about your description of your application:- How is PCA _recommending_ anything? How did you measure _performance_? Similarly for your last comment:- The _optimum_ for what? – Scortchi - Reinstate Monica Jun 28 '13 at 11:56
  • 5
    @Lucozade: Indeed, listen please what Scortchi said, you seem to continue chasing down spooks. PCA is simply a special form of rotating data in space. It always does optimally what it does with the input data. The cov-corr dilemma is a pragmatic one, rooted in data pre-processing and being solved at that level, not at the PCA level. – ttnphns Jun 28 '13 at 12:25
  • 1
    @Lucozade: It would be my (non-expert) opinion based on your reply to me that in your specific need, you are right to want cov-based PCA. Again, your variables are all homogeneous in terms of data/measurement type (same machine type, and all data in volts). To me your example is clearly a case where cov-PCA is correct, but please note that this is not always the case, and I think this the important point of this while thread (the choice of cor v. cov is case specific and needs to be determine by the person who understands the data & application best). Good luck with your research! – Patrick Jun 28 '13 at 12:55
  • "How is PCA recommending anything?" I was thinking this same thing, my guess is the machines are ordered along PC1, and this is what Lucozade wants to use to "find an optimum spatial positioning for his machines" (paraphrased) – Patrick Jun 28 '13 at 13:15
  • Why is "unstandardization afterward mandatory"? Why would this be true for descriptive/visualization purposes with heterogeneous data types? – Patrick Jun 28 '13 at 19:08
6

Speaking from a practical viewpoint - possibly unpopular here - if you have data measured on different scales, then go with correlation ('UV scaling' if you are a chemometrician), but if the variables are on the same scale and the size of them matters (e.g. with spectroscopic data), then covariance (centering the data only) makes more sense. PCA is a scale-dependent method and also log transformation can help with highly skewed data.

In my humble opinion based on 20 years of practical application of chemometrics you have to experiment a bit and see what works best for your type of data. At the end of the day you need to be able to reproduce your results and try to prove the predictability of your conclusions. How you get there is often a case of trial and error but the thing that matters is that what you do is documented and reproducible.

Nick Cox
  • 48,377
  • 8
  • 110
  • 156
mark
  • 69
  • 1
  • 5
    The practical approach you seem to advocate here boils down to - when both covariances and correlations are warranted - "try both and see what works best". That pure empirical stance masks the fact that any choice goes with its own assumptions or paradigm about the reality which the researcher ought to be aware of in advance, even if he understands that he prefers one of them fully arbitrarily. Selecting "what works best" is the capitalizing on the feeling of pleasure, the narcomania. – ttnphns Jun 28 '13 at 23:26
-1

I have no time to go into a fuller description of detailed & technical aspects of the experiment I described, and clarifications on wordings (recommending, performance, optimum) would again divert us away from the real issue, which is about what type of input data the PCA can(not) / should (not) be taking. PCA operates by taking linear combinations of numbers (values of variables). Mathematically, of course, one can add any two (real or complex) numbers. But if they have been re-scaled before PCA transformation, is their linear combination (and hence to process of maximization) still meaningful to operate on? If each variable $x_i$ has same variance $s^2$, then clearly yes, because $(x_1/s_1)+(x_2/s_2)=(x_1+x_2)/s$ is still proportional and comparable to the physical superposition of data $x_1+x_2$ itself. But if $s_1\not =s_2$, then the linear combination of standardized quantities distorts the data of the input variables to different degrees. There seems little point then to maximize the variance of their linear combination. In that case, PCA gives a solution for a different set of data, whereby each variable is scaled differently. If you then unstandardize afterwards (when using corr_PCA) then that may be OK and necessary; but if you just take the the raw corr_PCA solution as-is and stop there, you would obtain a mathematical solution, but not one related to the physical data. As unstandardization afterwards then seems mandatory as a minimum (i.e., 'unstretching' the axes by the inverse standard deviations), cov_PCA could have been used to begin with. If you are still reading by now, I am impressed! For now, I finish by quoting from Jolliffe's book, p. 42, which is the part that concerns me: 'It must not be forgotten, however, that correlation matrix PCs, when re-expressed in terms of the original variables, are still linear functions of x that maximize variance with respect to the standardized variables and not with respect to the original variables.' If you think I am interpreting this or its implications wrongly, this excerpt may be a good focus point for further discussion.

Lucozade
  • 619
  • 1
  • 6
  • 7
  • The missing reference here is Jolliffe, I.T. 2002. _Principal component analysis._ New York: Springer. [various misspellings of the author's name are common in citations] – Nick Cox Jun 28 '13 at 20:10
  • 3
    It is so amusing that your own answer, which is in tune with everything people here were trying to convey to you, remains unsettled for you. You still argue `There seems little point` in PCA on correlations. Well, if you need to stay close to raw data ("physical data", as you strangely call it), you really shouldn't use correlations since they correspond to another ("distorted") data. – ttnphns Jun 28 '13 at 20:42
  • 2
    (Cont.) Jolliffe's citation states, that PCs obtained on correlations will ever be themselves and cannot be turned "back" into PCs on covariances even though you can re-express them as linear combinations of the original variables. Thus, Jolliffe stresses the idea that PCA results are fully dependent on the type of pre-processing used and that there exist no "true", "genuine" or "universal" PCs... – ttnphns Jun 28 '13 at 20:43
  • 2
    (Cont.) And in fact, Several lines below Jolliffe speaks of yet another "form" of PCA - PCA on `X'X` matrix. This form is even "closer" to original data than cov-PCA because no centering of variables are being done. And the results are usually [utterly different](http://stats.stackexchange.com/a/22331/3277). You could also do PCA on cosines. People do PCA on all versions of the [SSCP matrix](http://stats.stackexchange.com/a/22520/3277), albeit covariances or correlations are used most often. – ttnphns Jun 28 '13 at 20:52
  • 5
    Underlying this answer is an implicit assumption that the units in which data are measured have an intrinsic meaning. That is rarely the case: we may choose to measure length in Angstroms, parsecs, or anything else, and time in picoseconds or millennia, without altering the *meaning* of the data one iota. The changes made in going from covariance to correlation are merely changes of units (which, by the way, are particularly sensitive to outlying data). This suggests the issue is not covariance *versus* correlation, but rather *to find fruitful ways to express the data for analysis.* – whuber Jun 28 '13 at 21:45
  • @whuber: I can't help approving your wise remarks... with the exception of that hazy `going from covariance to correlation are merely changes of units`. It is "merely" this if all the variances are equal (and all variables are same units), else implications are more profound (e.g. see my answer). – ttnphns Jun 28 '13 at 22:59
  • 3
    @ttnphns I'll stick by the "merely," thanks. Whether or not the implications are "profound," the fact remains that standardization of a variable literally is an affine re-expression of its values: a change in its units of measure. The importance of this observation lies in its implications for some claims appearing in this thread, of which the most prominent is "covariance-based PCA is the only truly correct one." Any conception of correctness that ultimately depends on an essentially *arbitrary* aspect of the data--how we write them down--cannot be right. – whuber Jun 30 '13 at 12:59
  • @whuber: the issue is that correlation PCA comes built-in, by definition, with standardization from x to z as a forward transformation of the original data. However, the reverse transformation that brings the PCs back to those of the pertinent x (and not z) is the vital (often) missing end operation for correlation PCA. If this operation is added, the correlation PCA reduces to the covariance PCA. Geometrically, in terms of stretching and unstretching the hyperplane of best fit, this necessity is quite obvious. One cannot claim to have solved the problem for z only if data was given for x. – Lucozade Jul 01 '13 at 13:50
  • @whuber: *Underlying this answer is an implicit assumption that the units in which data are measured have an intrinsic meaning. *. This may be assumed by some, but not by me. Rather, it is the PCA solution that is intrinsicically linked to the units of the input data. If you take away the link by standardization, without re-expression in terms of the original units, what is the use and applicability of the solution. – Lucozade Jul 01 '13 at 13:59
  • @ttnphns: in my experience, one always starts from raw data as the starting point of a PCA analysis. Cases where your input data of interest (dimensioned or dimensionless) are already centered and sphered seem pathological. – Lucozade Jul 01 '13 at 14:08
  • @ttnphns: *...if you need to stay close to raw data...*: to me, that seems always a necessity. Can you give an example where you have liberty to transform your data and have liberty not to back-transform your solution to the original data & units? This would be like using substiution method for solving an integral and not bother expressing the solution for the transformed variable back terms of the original variable of integration. – Lucozade Jul 01 '13 at 14:21
  • 1
    "Geometrically [quite] obvious," maybe: but not true! The PCA for the correlation, when back-transformed to the original units, is *not* the PCA for the covariance. That would amount to a claim that orthogonal matrices commute with arbitrary diagonal matrices, which is easily refuted algebraically or by counterexample. – whuber Jul 01 '13 at 14:32
  • @ttnphns: *...and cannot be turned "back" into PCs on covariances...you can re-express them as linear combinations of the original variables*. This seems contradictory to me. Can you clarify what you mean here? I have shown using my own data that I can indeed retrieve the cov_PCs from the corr_PCs (based on an independent check), which is of course no surprise because raw or centered and standardized data are linked via a linear transformation, hence their PCs are also linked deterministically. – Lucozade Jul 01 '13 at 14:34
  • @Lucozade: Gosh! `I can indeed retrieve the cov_PCs from the corr_PCs` Please, go and show it as addendum in your answer. (Really, any one cov PC is the linear combination of the _entire set_ of corr PCs, and vice versa. So what?) – ttnphns Jul 01 '13 at 15:19