Why does Andrew Ng prefer to use SVD and not EIG of covariance matrix to do PCA?

Question

I am studying PCA from Andrew Ng's Coursera course and other materials. In the Stanford NLP course cs224n's first assignment, and in the lecture video from Andrew Ng, they do singular value decomposition instead of eigenvector decomposition of covariance matrix, and Ng even says that SVD is numerically more stable than eigendecomposition.

From my understanding, for PCA we should do SVD of the data matrix of (m,n) size, not of the covariance matrix of (n,n) size. And eigenvector decomposition of covariance matrix.

Why do they do SVD of covariance matrix, not data matrix?

For square symmetric positive semidefinite matrix (such as covariance matrix), eigenvalue and singular value decompositions are exactly the same. — amoeba, Nov 16 '17 at 10:52
I mean they are *mathematically* the same. *Numerically* they might indeed use different algorithms and one might be more stable than another (as Ng says). This would be interesting to know more about, +1. — amoeba, Nov 16 '17 at 11:18
Some information about this here: https://de.mathworks.com/matlabcentral/newsreader/view_thread/21268. But note that any explanation about why one algorithm would be more stable than another is going to be very technical. — amoeba, Nov 16 '17 at 11:22
In Matlab `x=randn(10000); x=x'*x; tic; eig(x); toc; tic; svd(x); toc;` on my machine outputs 12s for eig() and 26s for svd(). If it's so much slower, it must at least be more stable! :-) — amoeba, Nov 16 '17 at 11:37
That might be based on an incorrect understanding: doing an SVD of the data matrix *is* more stable than using `eig` or `svd` on the covariance matrix, but as far as I know there is no big difference between using `eig` or `svd` on the covariance matrix --- they are both backward stable algorithms. If anything, I would put my money on eig being *more* stable, since it does fewer computations (assuming both are implemented with state-of-the-art algorithms). — Federico Poloni, Nov 16 '17 at 14:28
@FedericoPoloni Actually, I think A. Ng might be right. Apparently, the LAPACK implementation of the SVD uses a divide-and-conquer approach as described here: https://en.wikipedia.org/wiki/Divide-and-conquer_eigenvalue_algorithm as opposed to the conventional QR algorithm used by the eigendecomposition code. The former appears to be more stable, although I have not read a detailed account of why. — cangrejo, Nov 16 '17 at 15:46
@FedericoPoloni I found a relevant paper. See my updated answer if you're interested. — cangrejo, Nov 16 '17 at 16:24
TL;DR The answer is in your question: SVD is much more numerically stable than EIG — Vladislavs Dovgalecs, Nov 16 '17 at 19:19
As @amoeba mentioned in comments, the question is not about comparing SVD on data to eigen decomposition on covariance matrix, it is about why svd was used on "covariance" matrix. — gunakkoc, Nov 16 '17 at 19:22
@broncoAbierto Interesting -- I didn't know about these details, in particular that two different algorithms are used for these two tasks in Lapack currently. — Federico Poloni, Nov 17 '17 at 10:55
Sometimes I get different decompositions. E.g. `C = [ 0.236553 -0.020460 0.029987;-0.020460 0.366393 0.018122;0.029987 0.018122 0.330042]` and in Octave, I get a reflection with `[Ve,D] = eig(C)` and a rotation with `[U,S,V] = svd(C)`. How do you deal with that when I need a rotation for visualizing the confidence ellipsoid? — Jacob, Jun 08 '19 at 03:20

cangrejo · Answer 1 · 2017-11-16T16:32:22.213

amoeba already gave a good answer in the comments, but if you want a formal argument, here it goes.

The singular value decomposition of a matrix $A$ is $A=U\Sigma V^T$, where the columns of $V$ are eigenvectors of $A^TA$ and the diagonal entries of $\Sigma$ are the square roots of its eigenvalues, i.e. $\sigma_{ii}=\sqrt{\lambda_i(A^TA)}$.

As you know, the principal components are the orthogonal projections of your variables onto the space of the eigenvectors of the empirical covariance matrix $\frac{1}{n-1}A^TA$. The variance of the components is given by its eigenvalues, $\lambda_i(\frac{1}{n-1}A^TA)$.

Consider any square matrix $B$, $\alpha \in \mathbb R$ and a vector $v$ such that $Bv=\lambda v$. Then

$B^kv=\lambda^kv$
$\lambda(\alpha B) = \alpha\lambda( B)$

Let us define $S=\frac{1}{n-1}A^TA$. The SVD of $S$ will compute the eigendecomposition of $S^TS=\frac{1}{(n-1)^2}A^TAA^TA$ to yield

the eigenvectors of $(A^TA)^TA^TA=A^TAA^TA$, which by property 1 are those of $A^TA$
the square roots of the eigenvalues of $\frac{1}{(n-1)^2}A^TAA^TA$, which by property 2, then 1, then 2 again, are $\sqrt{\frac{1}{(n-1)^2} \lambda_i(A^TAA^TA)} = \sqrt{\frac{1}{(n-1)^2} \lambda_i^2(A^TA)} = \frac{1}{n-1}\lambda_i(A^TA) = \lambda_i(\frac{1}{n-1}A^TA)$.

Voilà!

Regarding the numerical stability, one would need to figure out what the employed alogrithms are. If you're up to it, I believe these are the LAPACK routines used by numpy:

Update: On the stability, the SVD implementation seems to be using a divide-and-conquer approach, while the eigendecomposition uses a plain QR algorithm. I cannot access some relevant SIAM papers from my institution (blame research cutbacks) but I found something that might support the assessment that the SVD routine is more stable.

In

Nakatsukasa, Yuji, and Nicholas J. Higham. "Stable and efficient spectral divide and conquer algorithms for the symmetric eigenvalue decomposition and the SVD." SIAM Journal on Scientific Computing 35.3 (2013): A1325-A1349.

they compare the stability of various eigenvalue algorithms, and it seems that the divide-and-conquer approach (they use the same one as numpy in one of the experiments!) is more stable than the QR algorithm. This, together with claims elsewhere that D&C methods are indeed more stable, supports Ng's choice.

The eigenvalues I obtained from svd on covariance and svd on mean centered data are not the same. — gunakkoc, Nov 16 '17 at 19:28
However, the scores, that is X*V (where V is obtained from [U,S,V] = svd(x) or svd(covx) ), are same. — gunakkoc, Nov 16 '17 at 19:30
@theGD Eigenvalues of cov(X) and singular values of (X) are not identical, see https://stats.stackexchange.com/questions/134282. — amoeba, Nov 16 '17 at 19:41
no need to despair w.r.t. lack of access to SIAM journals: the paper you cite is here: http://www.opt.mist.i.u-tokyo.ac.jp/~nakatsukasa/publishedpdf/pub13.pdf — Dima Pasechnik, Nov 17 '17 at 00:49
@theGD the eigenvalues from svd of covariance mtx must be square of svd of data matrix, isn't it? From my test for 2x2 matrix, it does. — DongukJu, Nov 17 '17 at 01:06
@DimaPasechnik Thanks, but I could access that one. The one I was interested in but couldn't view is Gu, Ming, and Stanley C. Eisenstat. "A divide-and-conquer algorithm for the symmetric tridiagonal eigenproblem." SIAM Journal on Matrix Analysis and Applications 16.1 (1995): 172-191. I'm sure they discuss stability in that one. — cangrejo, Nov 17 '17 at 09:53
@broncoAbierto the tech. report is here: https://cpsc.yale.edu/sites/default/files/files/tr932.pdf (one probably cannot easily find it due to a typo "Symetric" in the title on https://cpsc.yale.edu/research/technical-reports/1992-technical-reports :-)) — Dima Pasechnik, Nov 17 '17 at 10:10

score 14 · Answer 2 · edited Jan 07 '20 at 15:46

@amoeba had excellent answers to PCA questions, including this one on relation of SVD to PCA. Answering to your exact question I'll make three points:

mathematically there is no difference whether you calculate PCA on the data matrix directly or on its covariance matrix
the difference is purely due to numerical precision and complexity. Applying SVD directly to the data matrix is numerically more stable than to the covariance matrix
SVD can be applied to the covariance matrix to perform PCA or obtain eigen values, in fact, it's my favorite method of solving eigen problems

It turns out that SVD is more stable than typical eigenvalue decomoposition procedures, especially, for machine learning. In machine learning it is easy to end up with highly collinear regressors. SVD works better in these cases.

Here's Python code to demo the point. I created a highly collinear data matrix, got its covariance matrix and tried to obtain the eigenvalues of the latter. SVD is still working, while ordinary eigen decomposition fails in this case.

import numpy as np
import math
from numpy import linalg as LA

np.random.seed(1)

# create the highly collinear series
T = 1000
X = np.random.rand(T,2)
eps = 1e-11
X[:,1] = X[:,0] + eps*X[:,1]

C = np.cov(np.transpose(X))
print('Cov: ',C)

U, s, V = LA.svd(C)
print('SVDs: ',s)

w, v = LA.eig(C)
print('eigen vals: ',w)

Output:

Cov:  [[ 0.08311516  0.08311516]
 [ 0.08311516  0.08311516]]
SVDs:  [  1.66230312e-01   5.66687522e-18]
eigen vals:  [ 0.          0.16623031]

Update

Answering to Federico Poloni's comment, here's the code with stability testing of SVD vs Eig on 1000 random samples of the same matrix above. In many cases Eig shows 0 small eigen value, which would lead to singularity of the matrix, and SVD doesn't do it here. SVD is about twice more precise on a small eigen value determination, which may or may not be important depending on your problem.

import numpy as np
import math
from scipy.linalg import toeplitz
from numpy import linalg as LA

np.random.seed(1)

# create the highly collinear series
T = 100
p = 2
eps = 1e-8

m = 1000 # simulations
err = np.ones((m,2)) # accuracy of small eig value
for j in range(m):
    u = np.random.rand(T,p)
    X = np.ones(u.shape)
    X[:,0] = u[:,0]
    for i in range(1,p):
        X[:,i] = eps*u[:,i]+u[:,0]

    C = np.cov(np.transpose(X))

    U, s, V = LA.svd(C)

    w, v = LA.eig(C)

    # true eigen values
    te = eps**2/2 * np.var(u[:,1])*(1-np.corrcoef(u,rowvar=False)[0,1]**2)
    err[j,0] = s[p-1] - te
    err[j,1] = np.amin(w) - te


print('Cov: ',C)
print('SVDs: ',s)
print('eigen vals: ',w)
print('true small eigenvals: ',te)

acc = np.mean(np.abs(err),axis=0)    
print("small eigenval, accuracy SVD, Eig: ",acc[0]/te,acc[1]/te)

Output:

Cov:  [[ 0.09189421  0.09189421]
 [ 0.09189421  0.09189421]]
SVDs:  [ 0.18378843  0.        ]
eigen vals:  [  1.38777878e-17   1.83788428e-01]
true small eigenvals:  4.02633695086e-18
small eigenval, accuracy SVD, Eig:  2.43114702041 3.31970128319

Here code the code works. Instead of generating the random covariance matrix to test the routines, I'm generating the random data matrix with two variables: $$x_1=u\\ x_2=u+\varepsilon v$$ where $u,v$ - independent uniform random variables. So, the covariance matrix is $$\begin{pmatrix} \sigma_1^2 & \sigma_1^2 + \varepsilon \rho \sigma_1 \sigma_2\\ \sigma_1^2 + \varepsilon \rho \sigma_1 \sigma_2 & \sigma_1^2 + 2 \varepsilon \rho \sigma_1 \sigma_2 + \varepsilon^2 \sigma_2^2\sigma^2\end{pmatrix}$$ where $\sigma_1^2,\sigma_2^2,\rho$ - variances of the uniforms and the correlation coeffient between them.

Its smallest eigenvalue: $$\lambda= \frac 1 2 \left(\sigma_2^2 \varepsilon^2 - \sqrt{\sigma_2^4 \varepsilon^4 + 4 \sigma_2^3 \rho \sigma_1 \varepsilon^3 + 8 \sigma_2^2 \rho^2 \sigma_1^2 \varepsilon^2 + 8 \sigma_2 \rho \sigma_1^3 \varepsilon + 4 \sigma_1^4} + 2 \sigma_2 \rho \sigma_1 \varepsilon + 2 \sigma_1^2\right)$$ The small eigenvalue can't be calculated by simply plugging the $\varepsilon$ into formula due to limited precision, so you need to Taylor expand it: $$\lambda\approx \sigma_2^2 \varepsilon^2 (1-\rho^2)/2$$

I run $j=1,\dots,m$ simulations of the realizations of the data matrix, calculate the eigenvalues of the simulated covariance matrix $\hat\lambda_j$, and obtain the errors $e_j=\lambda-\hat\lambda_j$.

Yes, but here OP is asking about SVD vs EIG applied *both* to the covariance matrix. — amoeba, Nov 16 '17 at 18:24
This is a good answer. I wish, to mention, however, that svd can't detect negative eigenvalues when there are any and you want to see them (if covariance matrix is not original but is, say, smoothed or estimated somehow or inferred or comes out of pairwise deletion of missing values). Moreover, eig on cov matrix remains a little bit faster than svd on it. — ttnphns, Nov 16 '17 at 22:03
@ttnphns, non positive definite matrix is an issue, of course — Aksakal, Nov 16 '17 at 22:27
Why do you think the eigendecomposition "fails"? Both results seems to be accurate within machine precision * norm(C). — Federico Poloni, Nov 17 '17 at 10:44
Also, I don't think one 2x2 example is sufficient. For instance, if you simply change that `np.random.seed(1)` to `np.random.seed(2)`, then SVD returns `9.81307787e-18` as second eigenvalue and the eigendecomposition returns `1.38777878e-17`, which seems to be a lot more accurate (it coincides with the result computed with Sagemath and 300 significant digits). — Federico Poloni, Nov 17 '17 at 10:51
And there is a more fundamental issue. Even if those were the true eigenvalues of the covariance matrix $C$, the matrix you pass to `eig` or `svd` is not the *exact* matrix $C$, but a computed approximation to it, which might as well be singular for all we know: damage has already been done in the matrix product. I think the only meaningful way to assess accuracy is comparing the eigenvalues of C with the same eigenvalues computed with many more significant digits. If you have access to Sagemath, you can do it with `RR = RealField(prec=300); CC = matrix(C).change_ring(RR); CC.eigenvalues()`. — Federico Poloni, Nov 18 '17 at 09:32
I'm not sure what that formula is, but it is evaluated in the same floating point arithmetic that is used for svd and eig, so it will have errors, too. You should compare against eigenvalues computed with higher accuracy, otherwise you will never know what the true eigenvalues of C are. — Federico Poloni, Nov 18 '17 at 17:38
And another problem is that both `svd` and `eig` probably take ad-hoc code paths for 2x2 problems (using formulas that will not be much different from your "exact" one). Their performance on a 2x2 problem may not reflect that obtained on a larger matrix. — Federico Poloni, Nov 18 '17 at 17:39
@FedericoPoloni, on FP arithmetic and not knowing the exact answer I disagree. In this case I do know the answer with enough precision for this task. On 2x2 you have a fair point. I'll think of something. — Aksakal, Nov 19 '17 at 01:22

score 7 · Answer 3 · answered Nov 23 '17 at 14:49

For Python users, I'd like to point out that for symmetric matrices (like the covariance matrix), it is better to use numpy.linalg.eigh function instead of a general numpy.linalg.eig function.

eigh is 9-10 times faster than eig on my computer (regardless of matrix size) and has better accuracy (based on @Aksakal's accuracy test).

I am not convinced with the demonstration of the accuracy benefit of SVD with small eigenvalues. @Aksakal's test is 1-2 orders of magnitude more sensitive to random state than to the algorithm (try plotting all errors instead of reducing them to one absolute maximum). It means that small errors in the covariance matrix will have a greater effect on accuracy than the choice of an eigendecomposition algorithm. Also, this is not related to the main question, which is about PCA. The smallest components are ignored in PCA.

A similar argument can be made about numerical stability. If I have to use the covariance matrix method for PCA, I would decompose it with eigh instead of svd. If it fails (which has not been demonstrated here yet), then it is probably worth rethinking the problem that you are trying to solve before starting to look for a better algorithm.

+1. Some info on `eigh` vs `eig`: https://mail.scipy.org/pipermail/numpy-discussion/2006-March/019618.html — amoeba, Nov 23 '17 at 14:59

Gruff · Answer 4 · 2018-02-17T06:03:13.803

2

To answer the last part of your question, "Why do they do SVD of covariance matrix, not data matrix?" I believe it is for performance and storage reasons. Typically, $m$ will be a very large number and even if $n$ is large, we would expect $m \gg n$.

Calculating the covariance matrix and then performing SVD on that is vastly quicker than calculating SVD on the full data matrix under these conditions, for the same result.

Even for fairly small values the performance gains are factors of thousands (milliseconds vs seconds). I ran a few tests on my machine to compare using Matlab:

That's just CPU time, but storage needs are just as, if not more, important. If you attempt SVD on a million by a thousand matrix in Matlab it will error by default, because it needs a working array size of 7.4TB.

edited Feb 17 '18 at 06:03

answered Feb 16 '18 at 07:31

Gruff

121
4

This does not answer the question which is about EIG of the cov matrix vs. SVD *of the covariance matrix*. – amoeba Feb 16 '18 at 08:06
1

His question at the end, highlighted in bold, states, "Why do they do SVD of covariance matrix, not data matrix?" which I answered. – Gruff Feb 16 '18 at 11:38
I'll edit the opening sentence to make it clear I was answering that part of the OP's question. I see how that could be confusing. Thanks. – Gruff Feb 17 '18 at 06:02
*If you attempt SVD on a million by a thousand matrix in Matlab it will error by default* Good numerical practice is using the thin SVD, in these cases. This will vastly improve storage size and performance. – Federico Poloni Oct 15 '18 at 11:50

score 1 · Answer 5 · answered Apr 28 '20 at 16:45

Some great answers already have been given to your questions, so I won't add a lot of new stuff. But I tried (i) to base my answer on the knowledge you seem to have and (ii) to be as concise as possible. So you - or others in a similar situation - may find this answer helpful.

(Simple) Mathematical Explanation

SVD and the eigendecomposition are closely related. Let $X \in \mathbb{R}^{n \times p}$ be a real data matrix, so you may define its covariance matrix $C \in \mathbb{R}^{p\times p}$ as \begin{equation} C = \frac{1}{n} X^T X. \end{equation}

1 | SVD of X

As you correctly stated, applying SVD on $X$ decomposes your original data in \begin{equation*} X = U S V^T \end{equation*} with $U \in \mathbb{R}^{n \times n}$, $V \in \mathbb{R}^{p \times p}$ being unitary, containing the (orthonormal) principal components and eigenvectors, respectively. The diagonal matrix $S \in \mathbb{R}^{n \times p}$ holds the singular values $s$.

2 | Eigendecomposition of C

Since $C$ is hermitian its eigendecomposition yields eigenvectors given by the unitary matrix $V$ with corresponding real eigenvalues $\lambda$ as entries of a diagonal matrix $\Lambda \in \mathbb{R}^{p \times p}$: \begin{equation} CV = V\Lambda \end{equation} In this case we may calculate the principal components by projecting the eigenvectors on the original data PCs $= X V^T$. Note that these PCs are scaled by their corresponding eigenvalues and are thus correlated.

3 | SVD of C

In order to answer your questions, recall that we can factorize $C$ - since it is symmetric - via \begin{equation} C = V\Lambda V^T \end{equation} using its eigenvectors and eigenvalues.

Note, that this true also by just rearranging the equation from section (2). We can therefore calculate the eigenvectors of $X$ by applying the SVD to $C$.

With just a bit more of effort we can now establish the relation between the singular values and the eigenvalues. Using the definition of the covariance we may as well write: \begin{align} C &= \frac{1}{n} ( U S V^T )^T ( U S V^T ) \\ &= \frac{1}{n} ( V S U^T U S V^T ) \\ &= \frac{1}{n} ( V S^2 V^T ) \end{align} The last equation holds since $U$ is unitary, that is $U^T U = \mathbb{1}$. Now by simply comparing this result with that from above we find: \begin{equation} \frac{1}{n} ( V S^2 V^T ) = V \Lambda V^T \quad \Rightarrow \quad \lambda = \frac{s^2}{n} \end{equation}

Python, NumPy and Algorithms

Just a basic example to explore the different behaviour of numpy's

linalg.svd(X)
linalg.svd(C)
linalg.eig(C)
linalg.eigh(C)

shows that linalg.eig() reveals some (at least for me) unexpected behaviour. Calculating the matrix $V^TV=\mathbb{1}$ for all four cases, we can get a visual idea of the respective precision. It seems from the figure below, that linalg.eig() only provides a stable solution up to dimension $d = \text{rank}(C) = \text{min}(n,p)$.

# Create random data
n,p = [100,300]
X = np.random.randn(n,p)

# Covariane matrix
C = X.T @ X /n


# Create figure environment
fig = plt.figure(figsize=(14,5))
ax1 = fig.add_subplot(141)
ax2 = fig.add_subplot(142)
ax3 = fig.add_subplot(143)
ax4 = fig.add_subplot(144)

# 1. SVD on X
# ---------------------
U,s,VT = np.linalg.svd(X)
V = VT.T
ax1.imshow(V.T@V,cmap='Reds',vmin=0,vmax=1)

# 2. SVD on C
# ---------------------
V,eigenvalues,VT = np.linalg.svd(C)
ax2.imshow(V.T@V,cmap='Reds',vmin=0,vmax=1)
# 3. Eigendecomposition on C
# -> linalg.eig()
# ---------------------
eigenvalues,V = np.linalg.eig(C)
sortIdx = np.argsort(eigenvalues)[::-1]
V = V[:,sortIdx]
ax3.imshow((V.T@V).real,cmap='Reds',vmin=0,vmax=1)

# 4. Eigendecomposition on C
# -> linalg.eigh()
# ---------------------
eigenvalues,V = np.linalg.eigh(C)
sortIdx = np.argsort(eigenvalues)[::-1]
V = V[:,sortIdx]
ax4.imshow((V.T@V).real,cmap='Reds',vmin=0,vmax=1)

for a in [ax1,ax2,ax3,ax4]:
    a.set_xticks([])
    a.set_yticks([])
ax1.set_title('svd(X)')
ax2.set_title('svd(C)')
ax3.set_title('eig(C)')
ax4.set_title('eigh(C)')
fig.subplots_adjust(wspace=0,top=0.95)  
fig.suptitle('Eigendecomposition in NumPy')
plt.show()

score 0 · Answer 6 · answered Mar 20 '20 at 15:28

If you apply SVD on the covariance matrix, your principal vectors are the same as applying SVD on the data matrix. So, mathematically they are equivalent in this case.

However, in terms of complexity, it does not make much sense to apply SVD on the covariance matrix: you have constructed the covariance matrix and then you pay for SVD which is more expensive than computing eigenvectors.

The best practice is to apply SVD directly on the data matrix, to save some flops (compared to Andrew Ng's way) and to achieve numerical stability of SVD routines (compared to eigendecomposition).

score 0 · Answer 7 · answered Jun 15 '20 at 22:05

If anyone cares about performance difference in a specific case, I compared SVD and Eigen-analysys on a 632x632 real symmetric matrix L using C++/Eigen. I used BDCSVD

// compute full V, U *not* needed
Eigen::BDCSVD<Eigen::MatrixXf> svd(L,Eigen::ComputeFullV);

and SelfAdjointEigenSolver

Eigen::SelfAdjointEigenSolver<Eigen::MatrixXf> eigenSolver(N);
eigenSolver.compute(L);

and ran each block of code 100 times (clang++ -O3 -DNEBUG) on a 3.2 Ghz Intel Core i5 Mac. These are the results:

 BCSSVD ..................... 10036 milliseconds
 SelfAdjointEigenSolver ..... 10338 milliseconds

So each takes about 100 msec for a single call. So a push speed wise.