12

One common thing to do when doing Principal Component Analysis (PCA) is to plot two loadings against each other to investigate the relationships between the variables. In the paper accompanying the PLS R package for doing Principal Component Regression and PLS regression there is a different plot, called the correlation loadings plot (see figure 7 and page 15 in the paper). The correlation loading, as it is explained, is the correlation between the scores (from the PCA or PLS) and the actual observed data.

It seems to me that loadings and correlation loadings are pretty similar, except that they are scaled a bit differently. A reproducible example in R, with the built in data set mtcars is as follows:

data(mtcars)
pca <- prcomp(mtcars, center=TRUE, scale=TRUE)

#loading plot
plot(pca$rotation[,1], pca$rotation[,2],
     xlim=c(-1,1), ylim=c(-1,1),
     main='Loadings for PC1 vs. PC2')

#correlation loading plot
correlationloadings <- cor(mtcars, pca$x)
plot(correlationloadings[,1], correlationloadings[,2],
     xlim=c(-1,1), ylim=c(-1,1),
     main='Correlation Loadings for PC1 vs. PC2')

loadingplot correlationloadinsplot

What is the difference in interpretation of these plots? And which plot (if any) is best to use in practice?

amoeba
  • 93,463
  • 28
  • 275
  • 317
user1593755
  • 195
  • 1
  • 1
  • 7
  • for a better view of the pca, use biplot(pca), it shows you the loading and the scores of the pca and so you can interpret it better. – Paul Jun 22 '14 at 21:07
  • 6
    `R` `prcomp` package recklessly calls eigenvectors "loadings". I [advice](http://stats.stackexchange.com/a/35653/3277) to keep these terms separate. Loadings are eigenvectors scaled up to the respective eigenvalues. – ttnphns Jun 23 '14 at 07:02
  • 1
    Explaining geometry of a loading plot: http://stats.stackexchange.com/a/119758/3277 – ttnphns Oct 16 '14 at 07:25

1 Answers1

15

Warning: R uses the term "loadings" in a confusing way. I explain it below.

Consider dataset $\mathbf{X}$ with (centered) variables in columns and $N$ data points in rows. Performing PCA of this dataset amounts to singular value decomposition $\mathbf{X} = \mathbf{U} \mathbf{S} \mathbf{V}^\top$. Columns of $\mathbf{US}$ are principal components (PC "scores") and columns of $\mathbf{V}$ are principal axes. Covariance matrix is given by $\frac{1}{N-1}\mathbf{X}^\top\mathbf{X} = \mathbf{V}\frac{\mathbf{S}^2}{{N-1}}\mathbf{V}^\top$, so principal axes $\mathbf{V}$ are eigenvectors of the covariance matrix.

"Loadings" are defined as columns of $\mathbf{L}=\mathbf{V}\frac{\mathbf S}{\sqrt{N-1}}$, i.e. they are eigenvectors scaled by the square roots of the respective eigenvalues. They are different from eigenvectors! See my answer here for motivation.

Using this formalism, we can compute cross-covariance matrix between original variables and standardized PCs: $$\frac{1}{N-1}\mathbf{X}^\top(\sqrt{N-1}\mathbf{U}) = \frac{1}{\sqrt{N-1}}\mathbf{V}\mathbf{S}\mathbf{U}^\top\mathbf{U} = \frac{1}{\sqrt{N-1}}\mathbf{V}\mathbf{S}=\mathbf{L},$$ i.e. it is given by loadings. Cross-correlation matrix between original variables and PCs is given by the same expression divided by the standard deviations of the original variables (by definition of correlation). If the original variables were standardized prior to performing PCA (i.e. PCA was performed on the correlation matrix) they are all equal to $1$. In this last case the cross-correlation matrix is again given simply by $\mathbf{L}$.

To clear up the terminological confusion: what the R package calls "loadings" are principal axes, and what it calls "correlation loadings" are (for PCA done on the correlation matrix) in fact loadings. As you noticed yourself, they differ only in scaling. What is better to plot, depends on what you want to see. Consider a following simple example:

Biplots

Left subplot shows a standardized 2D dataset (each variable has unit variance), stretched along the main diagonal. Middle subplot is a biplot: it is a scatter plot of PC1 vs PC2 (in this case simply the dataset rotated by 45 degrees) with rows of $\mathbf{V}$ plotted on top as vectors. Note that $x$ and $y$ vectors are 90 degrees apart; they tell you how the original axes are oriented. Right subplot is the same biplot, but now vectors show rows of $\mathbf{L}$. Note that now $x$ and $y$ vectors have an acute angle between them; they tell you how much original variables are correlated with PCs, and both $x$ and $y$ are much stronger correlated with PC1 than with PC2. I guess that most people most often prefer to see the right type of biplot.

Note that in both cases both $x$ and $y$ vectors have unit length. This happened only because the dataset was 2D to start with; in case when there are more variables, individual vectors can have length less than $1$, but they can never reach outside of the unit circle. Proof of this fact I leave as an exercise.

Let us now take another look at the mtcars dataset. Here is a biplot of the PCA done on correlation matrix:

mtcars pca biplot

Black lines are plotted using $\mathbf{V}$, red lines are plotted using $\mathbf{L}$.

And here is a biplot of the PCA done on the covariance matrix:

mtcars pca biplot

Here I scaled all the vectors and the unit circle by $100$, because otherwise it would not be visible (it is a commonly used trick). Again, black lines show rows of $\mathbf{V}$, and red lines show correlations between variables and PCs (which are not given by $\mathbf{L}$ anymore, see above). Note that only two black lines are visible; this is because two variables have very high variance and dominate the mtcars dataset. On the other hand, all red lines can be seen. Both representations convey some useful information.

P.S. There are many different variants of PCA biplots, see my answer here for some further explanations and an overview: Positioning the arrows on a PCA biplot. The prettiest biplot ever posted on CrossValidated can be found here.

amoeba
  • 93,463
  • 28
  • 275
  • 317
  • 2
    Although this is a very good answer (+1), it has one just didactical weakness, in that it initially puts variables in rows of X, not in columns of X as would traditionally go in statistical datasets/examples. Because of that transpose, U vectors become in the answer to be about variables and V about cases. Most people knowing PCA are accustomed to the opposite layout; so it hinders perception, a bit. – ttnphns Oct 16 '14 at 12:43
  • 1
    I might recommend to stress verbally the "moral" of the difference between the "axes biplot" and the "loadings biplot" on the scan. In the first, variability (=scale, =magnidute, =inertia, =mass) isn't presented: it is stored away in eigenvalues. In the second, it was given up fully to eigenvectors representing variables; by virtue of that "reviving" the variables become meaningful data cloud of two points, or vectors, with specific lenghths from the origin and specific angle. This is how we "suddenly" find ourselves in _subject space_. – ttnphns Oct 16 '14 at 13:01
  • Thanks @ttnphns, both good points. Regarding rows/columns of $\mathbf X$: in fact, I prefer the layout I used. A single data point is usually written as a column vector $\mathbf x$. A matrix $\mathbf U$ acting on it would be written as $\mathbf U \mathbf x$. If now $\mathbf X$ is a collection of column vectors stacked together, then I can write $\mathbf U\mathbf X$, which is convenient. If, instead, $\mathbf X$ has samples in rows, as you advocate, then I would need to write $\mathbf X \mathbf U^\top$, which looks weird. But I admit that many textbooks use this convention (I am not sure why). – amoeba Oct 16 '14 at 14:44
  • Regarding biplots: additional source of confusion here is that "biplot", as originally introduced (see http://en.wikipedia.org/wiki/Biplot), is supposed to decompose $X$ into $X=USV^\top$, and then superimpose scatter plots of $US^\alpha$ and $S^{1-\alpha}V$. Superimposing principal components (which have $S$ in them) with loadings (which also have $S$ in them), makes it, according to this definition, *not a biplot*. I rarely (in fact almost never) use biplots, so I don't have a good intuition about whether such a "pseudo-biplot" is actually useful or rather not... What is your take on that? – amoeba Oct 16 '14 at 14:54
  • 1
    This is, of course, a matter of taste. Mind, however, that the great majority of statistical programs show data spreadsheets as `cases X variables`. By tradition then, linear algebra in most statistical analysis texts make case a row vector. Maybe in machine learning it is different? – ttnphns Oct 16 '14 at 15:01
  • Regarding your comment on biplots. Right you are. But there should be no confusion or indecision. The formulae you write tell us: eigenvectors (both left and right) are standardized or coagulated coordinates of rows and columns on biplot. We can spread inertia (magnitude of eigenvalues) over them however we like. Give it full to rows -> get "raw component scores" (to borrow term from PCA); Give it full to columns -> get what is called "variable loadings" in PCA. – ttnphns Oct 16 '14 at 15:14
  • (cont.) Or may spread inertia equally -> get biplot characteristic for correspondence analysis. Or give in full to both U and V vectors ("spread twice")... These are all different _normalizations_ of biplot: so, there exist many layouts of biplot. – ttnphns Oct 16 '14 at 15:16
  • PCA, Correspondence analysis are simply particular and special forms (with some data normalizations) of svd-based biplot. The main laws about coordinates are: `rows: XV = US` and `cols: X'U = VS'`. That corresponds to giving inertia fully to both types of eigenvectors. Given that we may give them any portion of inertia between 0 (no) and 1 (full), the formulas look more generally as this: `rows: XV(S'^(p1-1)) = U(S^p1)` and `cols: X'U(S(^p2-1)) = V(S'^p2)`, where p1 and p2 are both [0,1]. If you require p1+p2=1 ("spreading" the inertia) then we arrive at the formulas you cite. – ttnphns Oct 16 '14 at 15:33
  • BTW, $U$ in the description of plots should probably be $V$. Arrows are rows of $V$ not of $U$. Judging from the comments, original answer had variables in rows, which might be the reason for the confusion. – VitoshKa Jun 04 '16 at 23:24
  • @VitoshKa: Yes, exactly. Thanks! Fixed. Let me know if you still see any problems. – amoeba Jun 04 '16 at 23:38
  • @amoeba I believe that in PCA, as in EFA, there is the problem of *factor rotation*. So, wouldn't it be more accurate to say that PCA loadings are $L=V\frac{S}{\sqrt{N-1}}R$ and PCA std scores are $\sqrt{N-1}UR$, where $R$ is an orthogonal matrix? – user_anon Sep 20 '18 at 12:31
  • 1
    @user_anon No, this answer considers standard PCA, without any factor rotations. – amoeba Sep 20 '18 at 20:37