Why PCA of data by means of SVD of the data?

Question

This question is about an efficient way to compute principal components.

Many texts on linear PCA advocate using singular-value decomposition of the casewise data. That is, if we have data $\bf X$ and want to replace the variables (its columns) by principal components, we do SVD: $\bf X=USV'$, singular values (sq. roots of the eigenvalues) occupying the main diagonal of $\bf S$, right eigenvectors $\bf V$ are the orthogonal rotation matrix of axes-variables into axes-components, left eigenvectors $\bf U$ are like $\bf V$, only for cases. We can then compute component values as $ \bf C=XV=US$.
Another way to do PCA of variables is via decomposition of $\bf R=X'X$ square matrix (i.e. $\bf R$ can be correlations or covariances etc., between the variables ). The decomposition may be eigen-decomposition or singular-value decomposition: with square symmetric positive semidefinite matrix, they will give the same result $\bf R=VLV'$ with eigenvalues as the diagonal of $\bf L$, and $\bf V$ as described earlier. Component values will be $\bf C=XV$.

Now, my question: if data $\bf X$ is a big matrix, and number of cases is (which is often a case) much greater than the number of variables, then way (1) is expected to be much slower than way (2), because way (1) applies a quite expensive algorithm (such as SVD) to a big matrix; it computes and stores huge matrix $\bf U$ which we really doesn't need in our case (the PCA of variables). If so, then why so many texbooks seem to advocate or just mention only way (1)? Maybe it is efficient and I'm missing something?

Generally we are interested in only the few principal components which explain most of the variance. It is possible to do a reduced SVD; for example if $X$ is of dimension $N \times p$ where $p << N$ then `R`'s `svd` function will compute only the first $p$ left and right singular vectors by default. — M. Berk, Dec 09 '13 at 12:21
@M.Berk: $p$ however, is the same in both approaches: they yield equivalent results (equal up to sign changes). Also, e.g. R calculates $\mathbf C$ only if asked for. — cbeleites unhappy with SX, Dec 09 '13 at 15:35
Do you have an references for way (1)? I'm only aware of PCA being implemented via SVD on the covariance matrix (i.e. way 2), as this avoids some numerical problems and abviously scales with the dimensionality, not the data set size. Way (1) I would call SVD, not PCA at all. I've only seen it in a pure SVD context, where one would not perform a complete decomposition actually. — Has QUIT--Anony-Mousse, Jan 01 '14 at 04:17
@Anony-Mousse, Just one to mention, `Joliffe, Principal component analysis, 2nd ed.` Actually, Joliffe describes both ways, but in the core chapter on PCA he says about just way 1, as far as I can remember. — ttnphns, Jan 01 '14 at 08:01
@Anony-Mousse, Way 1 for me is important from theoretical point because it clearly shows how PCA is directly related to simple _correspondence analysis_. — ttnphns, Jan 01 '14 at 08:07
I agree that it is interesting from a theoretical point of view. It may well be that (1) has the nicer formal properties, while (2) has the better computational efficiency. — Has QUIT--Anony-Mousse, Jan 01 '14 at 12:53

score 22 · Answer 1 · edited Apr 13 '17 at 12:44

SVD is slower but is often considered to be the preferred method because of its higher numerical accuracy.

As you state in the question, principal component analysis (PCA) can be carried out either by SVD of the centered data matrix $\mathbf X$ (see this Q&A thread for more details) or by the eigen-decomposition of the covariance matrix $\frac{1}{n-1}\mathbf X^\top \mathbf X$ (or, alternatively, $\mathbf{XX}^\top$ if $n\ll p$, see here for more details).

Here is what is written in the MATLAB's pca() function help:

Principal component algorithm that pca uses to perform the principal component analysis [...]:

'svd' -- Default. Singular value decomposition (SVD) of X.

'eig' -- Eigenvalue decomposition (EIG) of the covariance matrix. The EIG algorithm is faster than SVD when the number of observations, $n$, exceeds the number of variables, $p$, but is less accurate because the condition number of the covariance is the square of the condition number of X.

The last sentence highlights the crucial speed-accuracy trade-off that is at play here.

You are right to observe that eigendecomposition of covariance matrix is usually faster than SVD of the data matrix. Here is a short benchmark in Matlab with a random $1000\times 100$ data matrix:

X = randn([1000 100]);

tic; svd(X); toc         %// Elapsed time is 0.004075 seconds.
tic; svd(X'); toc        %// Elapsed time is 0.011194 seconds.
tic; eig(X'*X); toc      %// Elapsed time is 0.001620 seconds.
tic; eig(X*X'); toc;     %// Elapsed time is 0.126723 seconds.

The fastest way in this case is via the covariance matrix (third row). Of course if $n \ll p$ (instead of vice versa) then it will be the slowest way, but in that case using the Gram matrix $\mathbf{XX}^\top$ (fourth row) will be the fastest way instead. SVD of the data matrix itself will be slower either way.

However, it will be more accurate because multiplying $\mathbf X$ with itself can lead to a numerical accuracy loss. Here is an example, adapted from @J.M.'s answer to Why SVD on $X$ is preferred to eigendecomposition of $XX^⊤$ in PCA on Math.SE.

Consider a data matrix $$\mathbf X = \begin{pmatrix}1&1&1\\\epsilon & 0 & 0\\ 0 & \epsilon & 0 \\ 0 & 0 & \epsilon\end{pmatrix},$$ sometimes called Läuchli matrix (and let us omit centering for this example). Its squared singular values are $3+\epsilon^2$, $\epsilon^2$, and $\epsilon^2$. Taking $\epsilon = 10^{-5}$, we can use SVD and EIG to compute these values:

eps = 1e-5;
X = [1 1 1; eye(3)*eps];
display(['Squared sing. values of X: ' num2str(sort(svd(X),'descend').^2')])
display(['Eigenvalues of X''*X:       ' num2str(sort(eig(X'*X),'descend')')])

obtaining identical results:

Squared sing. values of X: 3       1e-10       1e-10
Eigenvalues of X'*X:       3       1e-10       1e-10

But taking now $\epsilon = 10^{-10}$ we can observe how SVD still performs well but EIG breaks down:

Squared sing. values of X: 3       1e-20       1e-20
Eigenvalues of X'*X:       3           0 -3.3307e-16

What happens here, is that the very computation of covariance matrix squares the condition number of $\mathbf X$, so especially in case when $\mathbf X$ has some nearly collinear columns (i.e. some very small singular values), first computing covariance matrix and then computing its eigendecomposition will result in loss of precision compared to direct SVD.

I should add that one is often happy to ignore this potential [tiny] loss of precision and rather use the faster method.

Also for the SVD you can write (or just use) the algorithm for $\mathbf X^T$ which has the opposite runtime behaviour (because the decomposition has symmetry wrt. transposing $\mathbf X$ ) and otherwise the same numeric behaviour. Numeric stability is an important point (+1) - though I guess it depends a lot the data whether that matters. E.g. I typically far too few cases and noisy measurements - so the stability of my models is typically not limited by the numeric stability of underlying SVDs but by the patients/cell culture batches or instrumental noise. — cbeleites unhappy with SX, Feb 23 '14 at 11:48
Thanks for the answer and for thorough consideration of pros and cons. — ttnphns, Feb 23 '14 at 18:09
amoeba, may that be that you find time to show a concrete example where numerical stability suffers by `eig()` approach? (Readers will benefit: there is a point of trade-off between speed and stability. How one may decide in a concrete practical situation?) — ttnphns, Jun 24 '16 at 11:18
@ttnphns I rewrote the whole answer, providing a concrete example. Take a look. — amoeba, Jun 30 '16 at 13:52
@amoeba, thank you very much for coming back and giving an example! I tried both epsilon examples in SPSS and got results like your's except for the very last line: instead of `3 0 -3.3307e-16` eigen in spss returned me `3 0 0`. It looks as if the function has some in-built and fixed tolerance value beyond which it zero-offs. In this example, the function appeared as if to hack the knot of numerical instability by zeroing both tiny eigenvalues, the "0" and the "-16". — ttnphns, Jun 30 '16 at 16:09
I personally tend to think that the "numerical instability/inaccurateness" danger is only of minor importance. Eigenvalues below, say, e-13 can be safely considered as just zero - for the majority if not all practical statistical analysis tasks. — ttnphns, Jun 30 '16 at 16:19
@ttnphns: Yes, rounding to zero (that SPSS might be doing) makes sense. Still, getting 0 for the last two eigenvalues is less accurate than getting `1e-20` which is the correct result and can be obtained via svd (as far as I understood your comment in SPSS as well). In general, I fully agree that in practice all of that is of minor importance. Whenever the difference in speed becomes substantial, I prefer to use the faster method. — amoeba, Jun 30 '16 at 16:21
Here's the thing: you wouldn't really know whether you'd be safe with the symmetric eigendecomposition or SVD unless you actually perform the computation (a so-called "wicked problem"). So, why not stick with the safe instead of gambling with the potentially unruly? (Regarding speed versus stability: I'd personally prioritize the latter over the former, unless someone is going to die unless those matrices just have to be decomposed RIGHT NOW AND FAST!!!1! ;)) — J. M. is not a statistician, Jun 30 '16 at 23:12
@J.M., it's a matter of taste. As for me, I strongly vote for speed, not accuracy. A practicing statistician mostly works with medium to large datasets; plus to it, he may need doing a matrix decomposition many times in loops. Time is crucial. To me, it is "silly" to do linear OLS regression via svd (very accurate, but slow), packages would typically use system of linear equations solver, or the sweep-operation approach. — ttnphns, Jul 01 '16 at 02:07
@ttnphs, sure, for most "garden-variety" (whatever that means) LS problems, "sweep" or the normal equations will do fine. But if you're angling for diagnostics (unless you also can't be arsed to check your results), or you have a sufficiently complicated basis set that you can't easily tell if some of those are in fact not independent, you'll use SVD. And honestly, people could stand to be in less of a hurry... :P — J. M. is not a statistician, Jul 01 '16 at 03:14

cbeleites unhappy with SX · Accepted Answer · 2014-01-07T11:35:24.320

Here are my 2ct on the topic

The chemometrics lecture where I first learned PCA used solution (2), but it was not numerically oriented, and my numerics lecture was only an introduction and didn't discuss SVD as far as I recall.
If I understand Holmes: Fast SVD for Large-Scale Matrices correctly, your idea has been used to get a computationally fast SVD of long matrices.
That would mean that a good SVD implementation may internally follow (2) if it encounters suitable matrices (I don't know whether there are still better possibilities). This would mean that for a high-level implementation it is better to use the SVD (1) and leave it to the BLAS to take care of which algorithm to use internally.

Quick practical check: OpenBLAS's svd doesn't seem to make this distinction, on a matrix of 5e4 x 100, svd (X, nu = 0) takes on median 3.5 s, while svd (crossprod (X), nu = 0) takes 54 ms (called from R with microbenchmark).
The squaring of the eigenvalues of course is fast, and up to that the results of both calls are equvalent.

timing  <- microbenchmark (svd (X, nu = 0), svd (crossprod (X), nu = 0), times = 10)
timing
# Unit: milliseconds
#                      expr        min         lq    median         uq        max neval
#            svd(X, nu = 0) 3383.77710 3422.68455 3507.2597 3542.91083 3724.24130    10
# svd(crossprod(X), nu = 0)   48.49297   50.16464   53.6881   56.28776   59.21218    10

update: Have a look at Wu, W.; Massart, D. & de Jong, S.: The kernel PCA algorithms for wide data. Part I: Theory and algorithms , Chemometrics and Intelligent Laboratory Systems , 36, 165 - 172 (1997). DOI: http://dx.doi.org/10.1016/S0169-7439(97)00010-5

This paper discusses numerical and computational properties of 4 different algorithms for PCA: SVD, eigen decomposition (EVD), NIPALS and POWER.

They are related as follows:

computes on      extract all PCs at once       sequential extraction    
X                SVD                           NIPALS    
X'X              EVD                           POWER

The context of the paper are wide $\mathbf X^{(30 \times 500)}$, and they work on $\mathbf{XX'}$ (kernel PCA) - this is just the opposite situation as the one you ask about. So to answer your question about long matrix behaviour, you need to exchange the meaning of "kernel" and "classical".

performance comparison

Not surprisingly, EVD and SVD change places depending on whether the classical or kernel algorithms are used. In the context of this question this means that one or the other may be better depending on the shape of the matrix.

But from their discussion of "classical" SVD and EVD it is clear that the decomposition of $\mathbf{X'X}$ is a very usual way to calculate the PCA. However, they do not specify which SVD algorithm is used other than that they use Matlab's svd () function.

    > sessionInfo ()
    R version 3.0.2 (2013-09-25)
    Platform: x86_64-pc-linux-gnu (64-bit)

    locale:
     [1] LC_CTYPE=de_DE.UTF-8       LC_NUMERIC=C               LC_TIME=de_DE.UTF-8        LC_COLLATE=de_DE.UTF-8     LC_MONETARY=de_DE.UTF-8   
     [6] LC_MESSAGES=de_DE.UTF-8    LC_PAPER=de_DE.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
    [11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C       

    attached base packages:
    [1] stats     graphics  grDevices utils     datasets  methods   base     

    other attached packages:
    [1] microbenchmark_1.3-0

loaded via a namespace (and not attached):
[1] tools_3.0.2

$ dpkg --list libopenblas*
[...]
ii  libopenblas-base              0.1alpha2.2-3                 Optimized BLAS (linear algebra) library based on GotoBLAS2
ii  libopenblas-dev               0.1alpha2.2-3                 Optimized BLAS (linear algebra) library based on GotoBLAS2

So, your testing (3.5 sec vs 54 msec) supports my line that "way 1" is considerably slower. Right? — ttnphns, Dec 09 '13 at 16:08
@ttnphns: yes. But as svd is provided by the BLAS that could be different with a different BLAS. I'd have expected that a good optimized BLAS does something like this. It doesn't seem to be the case with OpenBLAS, however. I'm too lazy to check other BLASs, but maybe a few people could check their other BLASs so we find out which ones are optimized for this case and which aren't. (I emailed the OpenBLAS developer and sent him a link to this question, so maybe he can add some info, e.g. reasons for not switching the algorithm to the `svd (X'X)` for long matrices.) — cbeleites unhappy with SX, Dec 09 '13 at 18:34
Some points need clarification (to me). Can the "kernel methods" be summarized as "work on $X'$ instead of $X$ when $n — Elvis, Feb 23 '14 at 07:29
@Elvis: a) There's more to kernel methods than just computing on $\mathbf X \mathbf X^T$, see e.g. http://stats.stackexchange.com/questions/2499/explain-what-a-kernel-is-in-plain-english. The equivalence is trivial for PCA (it doesn't matter whether you start by getting right or left singular vectors) but not for other methods. b) "which way to do NIPALS" relies on the same overall principle. Which algorithm is used for the SVD depends on your BLAS, and actually my guess is that NIPALS is not involved here. Note that my timing includes the computation of the cross product. — cbeleites unhappy with SX, Feb 23 '14 at 11:30
I was talking about your update, where Nipals is involved. I confirm Nipals is not involved in Lapack’s SVD. About your benchmark experiment, something like `microbenchmark(X — Elvis, Feb 23 '14 at 14:57

Why PCA of data by means of SVD of the data?

2 Answers2

Linked

Related