PCA in numpy and sklearn produces different results

Question

Am i misunderstanding something. This is my code

using sklearn

import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from sklearn import decomposition
from sklearn import datasets
from sklearn.preprocessing import StandardScaler

pca = decomposition.PCA(n_components=3)

x = np.array([
        [0.387,4878, 5.42],
        [0.723,12104,5.25],
        [1,12756,5.52],
        [1.524,6787,3.94],
    ])
pca.fit_transform(x)

Output:

array([[ -4.25324997e+03,  -8.41288672e-01,  -8.37858943e-03],
   [  2.97275001e+03,  -1.25977271e-01,   1.82476780e-01],
   [  3.62475003e+03,  -1.56843494e-01,  -1.65224286e-01],
   [ -2.34425007e+03,   1.12410944e+00,  -8.87390454e-03]])

Using numpy methods

x_std = StandardScaler().fit_transform(x)
cov = np.cov(x_std.T)
ev , eig = np.linalg.eig(cov)
a = eig.dot(x_std.T)

Output

array([[ 0.06406894,  0.94063993, -1.62373172],
   [-0.35357757,  0.7509653 ,  0.63365168],
   [ 0.29312477,  0.6710958 ,  1.11766206],
   [-0.00361615, -2.36270102, -0.12758202]])

I have kept all 3 components but it doesnt seem to allow me to retain my original data.

May I know why is it so?

If I want to obtain back my original matrix what should I do?

Your numpy code is wrong IMHO (also, it uses `X` which is not defined). Recheck your *math*. — Has QUIT--Anony-Mousse, Sep 20 '16 at 06:12
I am using ipython notebook so i could only copy by cells. My maths is wrong? Which part @Anony-Mousse — aceminer, Sep 20 '16 at 08:41
@Anony-Mousse Yes i realised my error but it still doesnt match up — aceminer, Sep 20 '16 at 08:50
@aceminer I am curious why you calculate covariance matrix of x_std.T, not x_std? — Evgeni Nabokov, Mar 03 '19 at 18:35
@EvgeniNabokov its been too long. Sry i cannot remember already — aceminer, Mar 06 '19 at 06:58
@aceminer I guess you transpose matrix because in your data "each column represents a variable, while the rows contain observations". For that case it is more informative to use the parameter rowvar = False, not transpose. See the answer from Nikolas Rieble. — Evgeni Nabokov, Mar 06 '19 at 20:37
Mathematically speaking, the PCA itself does not include StandardScaler().fit_transform(x); if you standardize it, the PCA is performed on the correlation matrix, rather than the co-variance matrix; that is why scikit learn does not put that step. — Tony, Apr 14 '20 at 13:13
`x_std = StandardScaler().fit_transform(x)` is not a numpy method as per the subtitle — jtlz2, Jun 09 '21 at 10:32

score 28 · Accepted Answer · edited Apr 13 '17 at 12:44

28

The difference is because decomposition.PCA does not standardize your variables before doing PCA, whereas in your manual computation you call StandardScaler to do the standardization. Hence, you are observing this difference: PCA on correlation or covariance?

If you replace

pca.fit_transform(x)

with

x_std = StandardScaler().fit_transform(x)
pca.fit_transform(x_std)

you will get the same result as with manual computation...

...but only up to the order of the PCs. That is because when you run

ev , eig = np.linalg.eig(cov)

you get eigenvalues not necessarily in the decreasing order. I get

array([ 0.07168571,  2.49382602,  1.43448827])

So you will want to order them manually. Sklearn does that for you.

Regarding reconstructing original variables, please see How to reverse PCA and reconstruct original variables from several principal components?

edited Apr 13 '17 at 12:44

Community

1

answered Sep 20 '16 at 09:57

amoeba

93,463
28
275
317

Just would like to check. Is it really necessary to standardize the matrix by its standard deviation? I seen examples where they do not do it – aceminer Sep 20 '16 at 13:50
It is not *necessary*, it is just one way of doing it. See the link that I put in the first paragraph: http://stats.stackexchange.com/questions/53 -- it really is all about this question. If you standardize, you do PCA on correlations. If you do not, you do PCA on covariances. – amoeba Sep 20 '16 at 14:07

score 17 · Answer 2 · edited May 23 '17 at 12:39

Here is a nice implementation with discussion and explanation of PCA in python. This implementation leads to the same result as the scikit PCA. This is another indicator that your PCA is wrong.

import numpy as np
from scipy import linalg as LA

x = np.array([
        [0.387,4878, 5.42],
        [0.723,12104,5.25],
        [1,12756,5.52],
        [1.524,6787,3.94],
    ])

#centering the data
x -= np.mean(x, axis = 0)  

cov = np.cov(x, rowvar = False)

evals , evecs = LA.eigh(cov)

you need to sort the eigenvalues (and eigenvectors accordingly) descending

idx = np.argsort(evals)[::-1]
evecs = evecs[:,idx]
evals = evals[idx]

a = np.dot(x, evecs)

Generally, I recommend you to you check your code by implementation a simple example (as simple as possible) and calculating by hand the correct resuls (and intermediate results). This helps you to identify the problem.

Note that `linalg.eigh` assumes the input is symmetric (that is, it is optimized to read only those values from one "triangle" half of the matrix). The OP uses `linalg.eig` instead, which makes no such assumption. — zr0gravity7, Feb 28 '22 at 22:12

PCA in numpy and sklearn produces different results

2 Answers2

Linked