21

I understand the relation between Principal Component Analysis and Singular Value Decomposition at an algebraic/exact level. My question is about the scikit-learn implementation.

The documentation says: "[TruncatedSVD] is very similar to PCA, but operates on sample vectors directly, instead of on a covariance matrix.", which would reflect the algebraic difference between both approaches. However, it later says: "This estimator [TruncatedSVD] supports two algorithm: a fast randomized SVD solver, and a “naive” algorithm that uses ARPACK as an eigensolver on (X * X.T) or (X.T * X), whichever is more efficient.". Regarding PCA, it says: "Linear dimensionality reduction using Singular Value Decomposition of the data to project it ...". And PCA implementation supports the same two algorithms (randomized and ARPACK) solvers plus another one, LAPACK. Looking into the code I can see that both ARPACK and LAPACK in both PCA and TruncatedSVD do svd on sample data X, ARPACK being able to deal with sparse matrices (using svds).

So, aside from different attributes and methods and that PCA can additionally do exact full singular value decomposition using LAPACK, PCA and TruncatedSVD scikit-learn implementations seem to be exactly the same algorithm. First question: Is this correct?

Second question: even though LAPACK and ARPACK use scipy.linalg.svd(X) and scipy.linalg.svds(X), being X the sample matrix, they compute the singular value decomposition or eigen-decomposition of $X^T*X$ or $X*X^T$ internally. While the "randomized" solver doesn't need to compute the product. (This is relevant in connection with numerical stability, see Why PCA of data by means of SVD of the data?). Is this correct?

Relevant code: PCA line 415. TruncatedSVD line 137.

drake
  • 312
  • 1
  • 2
  • 9
  • 1
    could you add a link to the code – seanv507 Oct 10 '16 at 21:49
  • 1
    drake - I think I agree with you on first Q. don't understand the second. what do you mean 'they compute the singular value decomposition or eigen-decomposition of XT∗XXT∗X or X∗XTX∗XT internally' - you have just shown the code where it is all done using SVD on X? - numerical issues refer to first computing covariance matrix (call it C) then finding eigenvectors of C – seanv507 Oct 11 '16 at 00:24
  • @seanv507 Regarding 2nd question -- I guess that scipy.linalg.svd(X) computes the svd by doing the eigen-decomposition of $X^T * X$ or/and $X * X^T$. Same thing for linalg.svds(X). Quoting: "a fast randomized SVD solver, and a “naive” algorithm that uses ARPACK as an eigensolver on (X * X.T) or (X.T * X)". See also last line in http://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.linalg.svds.html . The only way I can understand the first quote is that the randomized algorithm is the only one that doesn't compute the covariance/gram matrix – drake Oct 11 '16 at 00:49
  • 1
    I would guess the ARPACK approach has to do with something like [Arnoldi iteration](https://en.wikipedia.org/wiki/Arnoldi_iteration), so it only has to do matrix-vector products. (In principle these sort of iterative methods do not even an explicit $X$, just a pair of routines `Xtimes()` and `Xt_times()`. This is common for large sparse matrices in PDE solvers, for example.) – GeoMatt22 Oct 11 '16 at 01:01
  • @GeoMatt22 Can you elaborate on your comment? Do you mean that ARPACK or LAPACK approaches don't suffer from numerical instabilities because they don't need to compute the covariance matrix? – drake Oct 11 '16 at 01:10
  • @drake I do not know the details of these packages. I am just vaguely familiar with these iterative methods that only require matrix-vector products. In cases like [PDE](https://en.wikipedia.org/wiki/Laplacian_matrix#Example_of_the_Operator_on_a_Grid)s or [PageRank](https://en.wikipedia.org/wiki/Google_matrix) the goal is usually saving computation time, as the matrix is huge but very sparse. – GeoMatt22 Oct 11 '16 at 01:19
  • Drake, your scipy quote shows that svds(sparse) is a 'naive' algorithm calculating XX^t. But scipy. linalg. svd uses dgesdd which doesn't ( https://groups.google.com/forum/m/#!topic/julia-dev/mmgO65i6-fA ) – seanv507 Oct 11 '16 at 06:21

2 Answers2

21

PCA and TruncatedSVD scikit-learn implementations seem to be exactly the same algorithm.

No: PCA is (truncated) SVD on centered data (by per-feature mean substraction). If the data is already centered, those two classes will do the same.

In practice TruncatedSVD is useful on large sparse datasets which cannot be centered without making the memory usage explode.

  • numpy.linalg.svd and scipy.linalg.svd both rely on LAPACK _GESDD described here: http://www.netlib.org/lapack/lug/node32.html (divide and conquer driver)

  • scipy.sparse.linalg.svds relies on ARPACK to do a eigen value decomposition of XT . X or X . X.T (depending on the shape of the data) via the Arnoldi iteration method. The HTML user guide of ARPACK has a broken formatting which hides the computational details but the Arnoldi iteration is well described on wikipedia: https://en.wikipedia.org/wiki/Arnoldi_iteration

Here is the code for the ARPACK-based SVD in scipy:

https://github.com/scipy/scipy/blob/master/scipy/sparse/linalg/eigen/arpack/arpack.py#L1642 (search for the string for "def svds" in case of line change in the source code).

ogrisel
  • 3,669
  • 22
  • 19
  • Thanks! I forgot to mention that PCA centered the data matrix, I saw that in the code, and that my question was about a column-wise centered X. Then what I don't understand is why there are two classes with different names (PCA and TruncatedSVD) when the only difference is that PCA centers X. Wouldn't it make more sense that scaling/centering was just an option/hyperparameter within a single class? Those 2 classes don't reflect the algebraic difference between PCA and SVD -- the former does an eigenvalue decomposition of the covariance matrix and the later a sing. value decom. of X. – drake Oct 11 '16 at 17:40
  • In my opinion it is confusing for someone who is learning PCA and SVD to have these two classes. – drake Oct 11 '16 at 17:42
  • 4
    One can support sparse data efficiently (TruncatedSVD), the other cannot (PCA). This is why we have 2 classes. – ogrisel Oct 12 '16 at 14:43
  • 1
    If that's the reason, then I'd call them SVD and SparseSVD (or similar) to avoid confusion. – drake Oct 12 '16 at 22:37
  • 2
    But people want PCA and they might not know that PCA is just SVD on centered data. – ogrisel Oct 13 '16 at 07:51
  • I see... The point is that even though the *exact* result is the same if the data matrix has been centered, the procedures are different (PCA uses covariance matrix and SVD uses data matrix), which give rise to numerical differences (PCA numerically unstable, see http://stats.stackexchange.com/a/87536/53653). This isn't reflected in scikit-learn implementations, as far as I see. I understand your point, though. Thanks. – drake Oct 13 '16 at 17:35
  • Note that the PCA class (in scikit-learn 0.18 and later) has all the solvers of the TruncatedSVD class in addition to the traditional LAPACK solver. By default, on large datasets a small value for n_components, the fast but approximate randomized SVD will be used as it is the only tractable solver on medium sized datasets (e.g. 10000 samples x 1000 features with n_components=30). – ogrisel Oct 14 '16 at 09:42
  • I saw that, as I said in the question. And in my opinion, that's a reason for unifying them in the same class. Anyway, that's just a suggestion. My doubts are now clarified. – drake Oct 14 '16 at 18:07
  • But on sparse data, we cannot physically center the data (the resulting dense array would be too large too fit in memory or the SVD would be too slow to compute on the dense array). Therefore cannot compute the true PCA model on sparse data. Naming truncated svd on uncentered data PCA would be incorrect. This is why we decided to introduce the TruncatedSVD class: to compute a truncated SVD when computing a true PCA is not possible. – ogrisel Oct 18 '16 at 09:13
  • @ogrisel ( wouldn't it make more sense to refactor to use common code). An intermediate position might be to threshold the means? eg subtract mean based on eg mean/standard deviation and desired sparseness – seanv507 Oct 18 '16 at 10:36
  • It's already using a lot of common code. Removing the column-wise means of a sparse matrix will remove the zeros of the matrix with high probability. This is not possible. – ogrisel Oct 18 '16 at 11:48
  • 5
    @drake I disagree that "the procedures are different (PCA uses covariance matrix and SVD uses data matrix)". PCA is a name for the type of analysis. One can use different algorithms and implementations to carry it out. EIG of cov matrix is one method, SVD of centered data matrix is another method, and then EIG and SVD can be performed by various methods as well. Does not matter - all of that is PCA. – amoeba Oct 18 '16 at 12:10
  • 1
    @amoeba Thanks for the clarification/correction on the terminology. What you say makes more sense to me, given that SVD and EIG are algebraic theorems/methods with a wider scope than PCA – drake Oct 18 '16 at 15:56
5

There is also a difference in how attribute explained_variance_ is calculated.

Let the data matrix $\mathbf X$ be of $n \times p$ size, where $n$ is the number of samples and $p$ is the number of variables. And $\mathbf{X}_c$ is the centered data matrix, i.e. column means have been subtracted and are now equal to zero in this matrix. Assume that we are reducing the dimensionality of the data from $p$ to $k \lt p$.

Then for sklearn.decomposition.PCA we have the following expressions: $$\mathbf{X}_c \approx \mathbf{U}_k \mathbf{S}_k \mathbf{V}_k^T \qquad (\text{truncated SVD of } \mathbf{X}_c);$$ $$\mathbf{L}_k = \frac{1}{n-1} \mathbf{S}^2_k \quad \Longleftrightarrow \quad \lambda_j = \frac{s_j^2}{n-1}, \quad \forall j =1,\ldots,k; \qquad(*)$$

where $\mathbf{L}_k = \mathrm{diag}(\lambda_1, \ldots, \lambda_k)$ is the matrix of $k$ largest eigenvalues of the covariance matrix $\mathbf{C} = \frac{1}{n-1} \mathbf{X}_c^T\mathbf{X}_c$, and $\mathbf{S}_k = \mathrm{diag}(s_1, \ldots, s_k)$ is the matrix of $k$ largest singular values of $\mathbf{X}_c$. You can simply prove $(*)$ if you substitute truncated SVD of $\mathbf{X}_c$ in the expression for the covariance matrix $\mathbf{C}$ and compare the result with the truncated eigendecomposition $\mathbf{C} \approx \mathbf{V}_k \mathbf{L} \mathbf{V}_k^T$ (here it was done for full decompositions). Matrix $\mathbf{L}_k$ is called explained_variance_ attribute in sklearn.decomposition.PCA.

But for sklearn.decomposition.TruncatedSVD only the following holds: $$\mathbf{X} \approx \tilde{\mathbf{U}}_k \tilde{\mathbf{S}}_k \tilde{\mathbf{V}}_k^T \qquad (\text{truncated SVD of } \mathbf{X}).$$ In this case we can't get simple equiation like $(*)$ that will link $\mathbf{L}_k$ and $\tilde{\mathbf{S}}_k$, because substitution of truncated SVD of $\mathbf{X}$ in the expression $\mathbf{C} = \frac{1}{n-1} \mathbf{X}_c^T\mathbf{X}_c = \frac{1}{n-1}\mathbf{X}^T\mathbf{X} - \frac{n}{n-1}\bar{x}\bar{x}^T$ will not be very useful. So explained_variance_ in sklearn.decomposition.TruncatedSVD is calculated instead by np.var(X_transformed, axis=0), where X_transformed = $\mathbf{X} \tilde{\mathbf{V}}_k$ – matrix of scores (new features).

Rodvi
  • 749
  • 1
  • 7
  • 17