6

I've been reading some documentation about PCA and trying to use scikit-learn to implement it. But I struggle to understand what are the attributes returned by sklearn.decompositon.PCA From what I read here and the name of this attribute my first guess would be that the attribute .components_ is the matrix of principal components, meaning if we have data set X which can be decomposed using SVD as

X = USV^T

then I would expect the attribute .components_ to be equal to

XV = US.

To clarify this I took the first example of the wikipedia page of Singular Value Decomposition (here), and try to implement it to see if I obtain what is expected. But I get something different. To be sure I didn't make a mistake I used scipy.linalg.svd to do the Singular Value Decomposition on my matrix X, and I obtained the result described on wikipedia:

X = np.array([[1, 0, 0, 0, 2],
          [0, 0, 3, 0, 0],
          [0, 0, 0, 0, 0],
          [0, 2, 0, 0, 0]])
U, s, Vh = svd(X)
print('U = %s'% U)
print('Vh = %s'% Vh)
print('s = %s'% s)

output:

U = [[ 0.  1.  0.  0.]
[ 1.  0.  0.  0.]
[ 0.  0.  0. -1.]
[ 0.  0.  1.  0.]]
Vh = [[-0.          0.          1.          0.          0.        ]
[ 0.4472136   0.          0.          0.          0.89442719]
[-0.          1.          0.          0.          0.        ]
[ 0.          0.          0.          1.          0.        ]
[-0.89442719  0.          0.          0.          0.4472136 ]]
s = [ 3.          2.23606798  2.          0.        ]

But with sk-learn I obtain this:

pca = PCA(svd_solver='auto', whiten=True)
pca.fit(X)
print(pca.components_)
print(pca.singular_values_)

and the output is

[[ -1.47295237e-01  -2.15005028e-01   9.19398392e-01  -0.00000000e+00
-2.94590475e-01]
[  3.31294578e-01  -6.62589156e-01   1.10431526e-01   0.00000000e+00
6.62589156e-01]
[ -2.61816759e-01  -7.17459719e-01  -3.77506920e-01   0.00000000e+00
-5.23633519e-01]
[  8.94427191e-01  -2.92048264e-16  -7.93318415e-17   0.00000000e+00
-4.47213595e-01]]
[  2.77516885e+00   2.12132034e+00   1.13949018e+00   1.69395499e-16]

which is not equal to SV^T (I spare you the matrix multiplication, since anyway you can see that the singular values are different from the one obtained above). I tried to see what happened if I set the parameter withened to False or the parameter svd_solver to 'full' but it doesn't change the result.

Do you see a mistake somewhere, or do you have an explanation?


GWa
  • 61
  • 1
  • 1
  • 2
  • See https://stats.stackexchange.com/questions/134282/relationship-between-svd-and-pca-how-to-use-svd-to-perform-pca/134283#134283 – Tim Nov 04 '17 at 23:02
  • 2
    PCA = SVD after centering. You did not center. Hence the difference. – amoeba Nov 04 '17 at 23:33
  • 1
    Thanks a lot amoeba. Indeed, I didn't pay attention to this centering step. Have a nice day! – GWa Nov 19 '17 at 09:04
  • This article (https://jakevdp.github.io/PythonDataScienceHandbook/05.09-principal-component-analysis.html) succinctly explains what "the attributes returned by sklearn.decompositon.PCA" are. – ANIRBAN DAS Nov 16 '20 at 09:01

3 Answers3

5

Annoyingly there is no SKLearn documentation for this attribute, beyond the general description of the PCA method.

Here is a useful application of pca.components_ in a classic facial-recognition project (using data bundled with SKL, so you don't have to download anything extra). Working through this concise notebook is the best way to get a feel for the definition & application of pca.components_

From that project, and this answer over on StackOverflow, we can learn that pca.components_ is the set of all eigenvectors (aka loadings) for your projection space (one eigenvector for each principal component). Once you have the eigenvectors using pca.components_, here's how to get eigenvalues.

For further info on the definitions & applications of eigenvectors vs loadings (including the equation that links all three concepts), see here.

For a 2nd project/notebook applying pca.components_ to (the same) facial recognition data, see here. It features a more traditional scree plot than the first project cited above.

olisteadman
  • 151
  • 1
  • 6
  • 1
    This introduces a lot of confusion. The link you gave DOES contain documentation for this attribute. And my old comment above explained what OP did wrong. – amoeba Apr 26 '18 at 11:36
  • Which eigenvector? projection space of the rows or the columns (U or V)? – Pandian Le Jun 25 '21 at 00:14
1

pca.components_ is not other than the Loading Scores. With PCA with SVD (singular value decomposition), the principal components are scaled to 1. Imagine the loading scores as a recipe of a cocktail where our PC1 is made with (loading scores for PC1) 0.97 part of Gen1 and 0.242 parts of Gen2, and PC2 is made -0.242 parts of Gen1 and 0.97 of Gen2 (loading scores for PC2).

This is indeed giving us the called Singlular Vector or eigenvector for each component ( the loading scores are the coefficients of each variable for the first component versus the coefficients for the second component).

This also indicates which variables have the largest effect on each component. Loadings can range from -1 to 1. Loadings close to -1 or 1 indicate that the variable strongly influences the component. Loadings close to 0 indicate that the variable has a weak influence on the component. Evaluating the loadings can also help you characterize each component in terms of the variables.

I would advice to watch the flowing video: https://www.youtube.com/watch?v=FgakZw6K1QQ

Antonio
  • 11
  • 2
1

components_ are mathematically the eigenvectors of the covariance matrix of the centered input matrix. This can be verified by using plain numpy.

Long Pollehn
  • 205
  • 1
  • 9