516

Principal component analysis (PCA) is usually explained via an eigen-decomposition of the covariance matrix. However, it can also be performed via singular value decomposition (SVD) of the data matrix $\mathbf X$. How does it work? What is the connection between these two approaches? What is the relationship between SVD and PCA?

Or in other words, how to use SVD of the data matrix to perform dimensionality reduction?

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
amoeba
  • 93,463
  • 28
  • 275
  • 317
  • 10
    I wrote this FAQ-style question together with my own answer, because it is frequently being asked in various forms, but there is no canonical thread and so closing duplicates is difficult. Please provide meta comments in [this accompanying meta thread](http://meta.stats.stackexchange.com/questions/2370). – amoeba Jan 22 '15 at 11:25
  • 2
    http://stats.stackexchange.com/questions/177102/what-is-the-intuition-behind-svd/179042#179042 – kjetil b halvorsen Feb 03 '16 at 10:45
  • 2
    In addition to an excellent and detailed amoeba's answer with its further links I might recommend to check [this](http://stats.stackexchange.com/q/141754/3277), where PCA is considered side by side some other SVD-based techniques. The discussion there presents algebra almost identical to amoeba's with just minor difference that the speech there, in describing PCA, goes about svd decomposition of $\mathbf X/\sqrt{n}$ [or $\mathbf X/\sqrt{n-1}$] instead of $\bf X$ - which is simply convenient as it relates to the PCA done via the eigendecomposition of the covariance matrix. – ttnphns Feb 03 '16 at 12:18
  • PCA is a special case of SVD. PCA needs the data normalized, ideally same unit. The matrix is nxn in PCA. – Orvar Korvar Oct 17 '17 at 09:12
  • 1
    @OrvarKorvar: What n x n matrix are you talking about ? – Cbhihe Mar 29 '18 at 15:16
  • @Cbhihe the `n x n` matrix is the covariance matrix of the data matrix `X` – YellowPillow Dec 05 '18 at 00:12

3 Answers3

620

Let the data matrix $\mathbf X$ be of $n \times p$ size, where $n$ is the number of samples and $p$ is the number of variables. Let us assume that it is centered, i.e. column means have been subtracted and are now equal to zero.

Then the $p \times p$ covariance matrix $\mathbf C$ is given by $\mathbf C = \mathbf X^\top \mathbf X/(n-1)$. It is a symmetric matrix and so it can be diagonalized: $$\mathbf C = \mathbf V \mathbf L \mathbf V^\top,$$ where $\mathbf V$ is a matrix of eigenvectors (each column is an eigenvector) and $\mathbf L$ is a diagonal matrix with eigenvalues $\lambda_i$ in the decreasing order on the diagonal. The eigenvectors are called principal axes or principal directions of the data. Projections of the data on the principal axes are called principal components, also known as PC scores; these can be seen as new, transformed, variables. The $j$-th principal component is given by $j$-th column of $\mathbf {XV}$. The coordinates of the $i$-th data point in the new PC space are given by the $i$-th row of $\mathbf{XV}$.

If we now perform singular value decomposition of $\mathbf X$, we obtain a decomposition $$\mathbf X = \mathbf U \mathbf S \mathbf V^\top,$$ where $\mathbf U$ is a unitary matrix and $\mathbf S$ is the diagonal matrix of singular values $s_i$. From here one can easily see that $$\mathbf C = \mathbf V \mathbf S \mathbf U^\top \mathbf U \mathbf S \mathbf V^\top /(n-1) = \mathbf V \frac{\mathbf S^2}{n-1}\mathbf V^\top,$$ meaning that right singular vectors $\mathbf V$ are principal directions and that singular values are related to the eigenvalues of covariance matrix via $\lambda_i = s_i^2/(n-1)$. Principal components are given by $\mathbf X \mathbf V = \mathbf U \mathbf S \mathbf V^\top \mathbf V = \mathbf U \mathbf S$.

To summarize:

  1. If $\mathbf X = \mathbf U \mathbf S \mathbf V^\top$, then columns of $\mathbf V$ are principal directions/axes.
  2. Columns of $\mathbf {US}$ are principal components ("scores").
  3. Singular values are related to the eigenvalues of covariance matrix via $\lambda_i = s_i^2/(n-1)$. Eigenvalues $\lambda_i$ show variances of the respective PCs.
  4. Standardized scores are given by columns of $\sqrt{n-1}\mathbf U$ and loadings are given by columns of $\mathbf V \mathbf S/\sqrt{n-1}$. See e.g. here and here for why "loadings" should not be confused with principal directions.
  5. The above is correct only if $\mathbf X$ is centered. Only then is covariance matrix equal to $\mathbf X^\top \mathbf X/(n-1)$.
  6. The above is correct only for $\mathbf X$ having samples in rows and variables in columns. If variables are in rows and samples in columns, then $\mathbf U$ and $\mathbf V$ exchange interpretations.
  7. If one wants to perform PCA on a correlation matrix (instead of a covariance matrix), then columns of $\mathbf X$ should not only be centered, but standardized as well, i.e. divided by their standard deviations.
  8. To reduce the dimensionality of the data from $p$ to $k<p$, select $k$ first columns of $\mathbf U$, and $k\times k$ upper-left part of $\mathbf S$. Their product $\mathbf U_k \mathbf S_k$ is the required $n \times k$ matrix containing first $k$ PCs.
  9. Further multiplying the first $k$ PCs by the corresponding principal axes $\mathbf V_k^\top$ yields $\mathbf X_k = \mathbf U_k^\vphantom \top \mathbf S_k^\vphantom \top \mathbf V_k^\top$ matrix that has the original $n \times p$ size but is of lower rank (of rank $k$). This matrix $\mathbf X_k$ provides a reconstruction of the original data from the first $k$ PCs. It has the lowest possible reconstruction error, see my answer here.
  10. Strictly speaking, $\mathbf U$ is of $n\times n$ size and $\mathbf V$ is of $p \times p$ size. However, if $n>p$ then the last $n-p$ columns of $\mathbf U$ are arbitrary (and corresponding rows of $\mathbf S$ are constant zero); one should therefore use an economy size (or thin) SVD that returns $\mathbf U$ of $n\times p$ size, dropping the useless columns. For large $n\gg p$ the matrix $\mathbf U$ would otherwise be unnecessarily huge. The same applies for an opposite situation of $n\ll p$.

Further links

Rotating PCA animation

amoeba
  • 93,463
  • 28
  • 275
  • 317
  • 1
    +1 for both Q&A. Thanks for sharing. I have one question: why do you have to assume that the data matrix is centered initially? – Antoine Aug 06 '15 at 08:39
  • after reading your summary point #5, I restate my question as: why is the covariance matrix equal to this quantity only if the data matrix is centered initially? – Antoine Aug 06 '15 at 08:58
  • 12
    @Antoine, covariance matrix is by definition equal to $\langle (\mathbf x_i - \bar{\mathbf x})(\mathbf x_i - \bar{\mathbf x})^\top \rangle$, where angle brackets denote average value. If all $\mathbf x_i$ are stacked as rows in one matrix $\mathbf X$, then this expression is equal to $(\mathbf X - \bar{\mathbf X})(\mathbf X - \bar{\mathbf X})^\top/(n-1)$. If $\mathbf X$ is centered then it simplifies to $\mathbf X \mathbf X^\top/(n-1)$. Think of variance; it's equal to $\langle (x_i-\bar x)^2 \rangle$. But if $\bar x=0$ (i.e. data are centered), then it's simply the average value of $x_i^2$. – amoeba Aug 06 '15 at 09:43
  • @amoeba Thanks a lot for your prompt and helpful answer. Regarding your summary point #4: Could you please elaborate on where the $\sqrt{n-1} U$ comes from when we standardize the scores? To me, the PCs, or scores, are defined as either the columns of $XV_{(n,p)}$ (PCA approach) or the columns of $US_{(n,p)}$ (SVD approach), that's it. – Antoine Aug 06 '15 at 09:48
  • @Antoine, yes, PC scores are given by the columns of $XV=US$. But if you want to standardize them, you want each column to have variance 1. In $US$, matrix $S$ is diagonal and so simply scales the columns of $U$ by various factors. To standardize, we don't care about scaling, so we can look at $U$ alone. But columns of $U$ have unit norm (because $U$ is an orthogonal matrix). We don't want unit norm, we want unit variance, and variance is given by the sum of squares divided by $(n-1)$. So the variance of columns of $U$ is $1/(n-1)$. To make it equal to 1, take $\sqrt{n-1}U$ instead. – amoeba Aug 06 '15 at 10:20
  • 1
    @amoeba for those less familiar with linear algebra and matrix operations, it might be nice to mention that $(A.B.C)^{T}=C^{T}.B^{T}.A^{T}$ and that $U^{T}.U=Id$ because $U$ is orthogonal – Antoine Feb 03 '16 at 15:20
  • 2
    A code sample for PCA by SVD: http://stackoverflow.com/questions/3181593/matlab-is-running-out-of-memory-but-it-should-not-be/3181851#3181851 – optimist Mar 23 '16 at 11:51
  • Hi @amoeba, Great answer here. Suer helpful. Question on your point 9 - what are the applications / uses / purposes of these lower rank reconstructed data? Thanks. – WillZ May 22 '16 at 00:06
  • @WillZhang: Thanks, I am glad it's helpful. What are the applications/uses of low-rank data reconstruction? For example, denoising. If your data are pictures, you can try to de-noise them by discarding low-variance PCs. You can search this website for `PCA reconstruct` and you will see that this comes up a lot. – amoeba May 22 '16 at 22:21
  • @amoeba Makes perfect sense, thank you for explaining. – WillZ May 23 '16 at 07:24
  • @amoeba where can I find book to learn SVD, PCA and other linear algebra-statistics algorithms with detailed and simple explanations? Is there any workbooks or courses with exercises? – Marat Zakirov Jun 14 '16 at 10:29
  • @amoeba, I really love the way you explain things. Can you show me few lines of proof on why mapping data into the space of eigen vectors can maximize the variance? or give me a related link. Thanks! – Haitao Du Jun 17 '16 at 19:16
  • @hxd1011: In fact, I have posted an answer to a very similar question just today, please see here: http://stats.stackexchange.com/questions/217995. – amoeba Jun 17 '16 at 20:10
  • Thanks so much for this post. Other than "left singular vectors", what is the conceptual term / meaning of $U$? I see from point 4 that columns of $\sqrt{n-1}U$ are the standardized scores, and columns of $US$ are the (not standarized) scores (point 2). So then is it correct to say that columns of $U$ are the samples in the new coordinate system, just as the scores are the samples in the new coordinate system? And then would we say that $U$ vs. $US$ is simply one of scaling the coordinate system? – Bryan Hanson Jul 21 '16 at 21:39
  • Yes, I think that's correct, @BryanHanson. One can say that columns of $U$ are scores that are normalized to have sum of squares equal to one. – amoeba Jul 21 '16 at 23:09
  • @amoeba why is it $n-1$ in denomintor? should it be just $n$? – Dims Sep 05 '17 at 21:35
  • @Dims See https://en.wikipedia.org/wiki/Bessel%27s_correction. – amoeba Sep 05 '17 at 21:40
  • 2
    @amoeba yes, but why use it? Also, is it possible to use the same denominator for $S$? The problem is that I see formulas where $\lambda_i = s_i^2$ and try to understand, how to use them? – Dims Sep 05 '17 at 22:49
  • @amoeba if in X matrix the rows are the features and the columns are the samples, then the covariance is X Xtrans, right ? and in this case, the mean centering is done across the rows. – seralouk Mar 02 '18 at 19:13
  • @sera Yes, that's correct. – amoeba Mar 02 '18 at 22:18
  • @amoeba thank you. one more question. If X matrix the rows are the features and the columns are the samples, then could you tell me how I can perform PCA usign SVD. Based on the PCA, the Xtransformed would be np.dot(Vtransposed, X) and then how can I connect this to the SVD decomposition ? – seralouk Mar 03 '18 at 23:10
  • 1
    @sera Just transpose your matrix and get rid of your problem. You will only be getting confused otherwise. – amoeba Mar 04 '18 at 14:45
  • @amoeba thank you. eventually I managed to find the relathionship also in this case which is Vtransposed_pca X = Sigma Vtransposed_svd and the pca eigenvectors V are in the U matrix in this case. – seralouk Mar 04 '18 at 19:55
  • thanks for the answer. I've browsed many websites and your answer is the clearest one. In prcomp() function, it produces a result called rotation. Is this rotation the eigenvector, $V$? – HappyCoding Jun 19 '18 at 03:13
  • Greate answer. Thanks. Regarding summary point #10, I think the last $n-p$ columns are not generally zero in SVD. Are you talking about a special case? – Ron Jul 04 '18 at 03:46
  • What a fantastic answer, thanks a lot! – fbartolic Apr 15 '20 at 23:01
  • when saying $X$ is `centered`, it's the mean of each row (or column) of $X$ having been removed, or the mean of all $X$ elements is removed from $X$? – ddzzbbwwmm Sep 07 '21 at 15:12
  • Since $S$ is not a symmetric matrix, so $X^T$ should be $(USV^T)^T = VS^TU^T$ and $\mathbf C = \mathbf V \mathbf S^T \mathbf U^\top \mathbf U \mathbf S \mathbf V^\top /(n-1) = \mathbf V \frac{\mathbf S^2}{n-1}\mathbf V^\top$, right? I don't know why you treat $S$ the same as $S^T$. – Belter Jan 01 '22 at 14:16
34

I wrote a Python & Numpy snippet that accompanies @amoeba's answer and I leave it here in case it is useful for someone. The comments are mostly taken from @amoeba's answer.

import numpy as np
from numpy import linalg as la
np.random.seed(42)


def flip_signs(A, B):
    """
    utility function for resolving the sign ambiguity in SVD
    http://stats.stackexchange.com/q/34396/115202
    """
    signs = np.sign(A) * np.sign(B)
    return A, B * signs


# Let the data matrix X be of n x p size,
# where n is the number of samples and p is the number of variables
n, p = 5, 3
X = np.random.rand(n, p)
# Let us assume that it is centered
X -= np.mean(X, axis=0)

# the p x p covariance matrix
C = np.cov(X, rowvar=False)
print "C = \n", C
# C is a symmetric matrix and so it can be diagonalized:
l, principal_axes = la.eig(C)
# sort results wrt. eigenvalues
idx = l.argsort()[::-1]
l, principal_axes = l[idx], principal_axes[:, idx]
# the eigenvalues in decreasing order
print "l = \n", l
# a matrix of eigenvectors (each column is an eigenvector)
print "V = \n", principal_axes
# projections of X on the principal axes are called principal components
principal_components = X.dot(principal_axes)
print "Y = \n", principal_components

# we now perform singular value decomposition of X
# "economy size" (or "thin") SVD
U, s, Vt = la.svd(X, full_matrices=False)
V = Vt.T
S = np.diag(s)

# 1) then columns of V are principal directions/axes.
assert np.allclose(*flip_signs(V, principal_axes))

# 2) columns of US are principal components
assert np.allclose(*flip_signs(U.dot(S), principal_components))

# 3) singular values are related to the eigenvalues of covariance matrix
assert np.allclose((s ** 2) / (n - 1), l)

# 8) dimensionality reduction
k = 2
PC_k = principal_components[:, 0:k]
US_k = U[:, 0:k].dot(S[0:k, 0:k])
assert np.allclose(*flip_signs(PC_k, US_k))

# 10) we used "economy size" (or "thin") SVD
assert U.shape == (n, p)
assert S.shape == (p, p)
assert V.shape == (p, p)
amoeba
  • 93,463
  • 28
  • 275
  • 317
turdus-merula
  • 1,371
  • 14
  • 20
29

Let me start with PCA. Suppose that you have n data points comprised of d numbers (or dimensions) each. If you center this data (subtract the mean data point $\mu$ from each data vector $x_i$) you can stack the data to make a matrix

$$ X = \left( \begin{array}{ccccc} && x_1^T - \mu^T && \\ \hline && x_2^T - \mu^T && \\ \hline && \vdots && \\ \hline && x_n^T - \mu^T && \end{array} \right)\,. $$

The covariance matrix

$$ S = \frac{1}{n-1} \sum_{i=1}^n (x_i-\mu)(x_i-\mu)^T = \frac{1}{n-1} X^T X $$

measures to which degree the different coordinates in which your data is given vary together. So, it's maybe not surprising that PCA -- which is designed to capture the variation of your data -- can be given in terms of the covariance matrix. In particular, the eigenvalue decomposition of $S$ turns out to be

$$ S = V \Lambda V^T = \sum_{i = 1}^r \lambda_i v_i v_i^T \,, $$

where $v_i$ is the $i$-th Principal Component, or PC, and $\lambda_i$ is the $i$-th eigenvalue of $S$ and is also equal to the variance of the data along the $i$-th PC. This decomposition comes from a general theorem in linear algebra, and some work does have to be done to motivate the relatino to PCA.

PCA of a Randomly Generated Gaussian Dataset

SVD is a general way to understand a matrix in terms of its column-space and row-space. (It's a way to rewrite any matrix in terms of other matrices with an intuitive relation to the row and column space.) For example, for the matrix $A = \left( \begin{array}{cc}1&2\\0&1\end{array} \right)$ we can find directions $u_i$ and $v_i$ in the domain and range so that

SVD for a 2x2 example

You can find these by considering how $A$ as a linear transformation morphs a unit sphere $\mathbb S$ in its domain to an ellipse: the principal semi-axes of the ellipse align with the $u_i$ and the $v_i$ are their preimages.

In any case, for the data matrix $X$ above (really, just set $A = X$), SVD lets us write

$$ X = \sum_{i=1}^r \sigma_i u_i v_j^T\,, $$

where $\{ u_i \}$ and $\{ v_i \}$ are orthonormal sets of vectors.A comparison with the eigenvalue decomposition of $S$ reveals that the "right singular vectors" $v_i$ are equal to the PCs, the "right singular vectors" are

$$ u_i = \frac{1}{\sqrt{(n-1)\lambda_i}} Xv_i\,, $$

and the "singular values" $\sigma_i$ are related to the data matrix via

$$ \sigma_i^2 = (n-1) \lambda_i\,. $$

It's a general fact that the right singular vectors $u_i$ span the column space of $X$. In this specific case, $u_i$ give us a scaled projection of the data $X$ onto the direction of the $i$-th principal component. The left singular vectors $v_i$ in general span the row space of $X$, which gives us a set of orthonormal vectors that spans the data much like PCs.

I go into some more details and benefits of the relationship between PCA and SVD in this longer article.

Andre P
  • 471
  • 5
  • 3
  • Thanks for your anser Andre. Just two small typos correction: 1. In the last paragraph you`re confusing left and right. 2. In the (capital) formula for X, you're using v_j instead of v_i. – Alon Sep 03 '19 at 13:09