32

Am i misunderstanding something. This is my code

using sklearn

import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from sklearn import decomposition
from sklearn import datasets
from sklearn.preprocessing import StandardScaler

pca = decomposition.PCA(n_components=3)

x = np.array([
        [0.387,4878, 5.42],
        [0.723,12104,5.25],
        [1,12756,5.52],
        [1.524,6787,3.94],
    ])
pca.fit_transform(x)

Output:

array([[ -4.25324997e+03,  -8.41288672e-01,  -8.37858943e-03],
   [  2.97275001e+03,  -1.25977271e-01,   1.82476780e-01],
   [  3.62475003e+03,  -1.56843494e-01,  -1.65224286e-01],
   [ -2.34425007e+03,   1.12410944e+00,  -8.87390454e-03]])

Using numpy methods

x_std = StandardScaler().fit_transform(x)
cov = np.cov(x_std.T)
ev , eig = np.linalg.eig(cov)
a = eig.dot(x_std.T)

Output

array([[ 0.06406894,  0.94063993, -1.62373172],
   [-0.35357757,  0.7509653 ,  0.63365168],
   [ 0.29312477,  0.6710958 ,  1.11766206],
   [-0.00361615, -2.36270102, -0.12758202]])

I have kept all 3 components but it doesnt seem to allow me to retain my original data.

May I know why is it so?

If I want to obtain back my original matrix what should I do?

aceminer
  • 813
  • 1
  • 9
  • 20
  • 1
    Your numpy code is wrong IMHO (also, it uses `X` which is not defined). Recheck your *math*. – Has QUIT--Anony-Mousse Sep 20 '16 at 06:12
  • I am using ipython notebook so i could only copy by cells. My maths is wrong? Which part @Anony-Mousse – aceminer Sep 20 '16 at 08:41
  • @Anony-Mousse Yes i realised my error but it still doesnt match up – aceminer Sep 20 '16 at 08:50
  • @aceminer I am curious why you calculate covariance matrix of x_std.T, not x_std? – Evgeni Nabokov Mar 03 '19 at 18:35
  • @EvgeniNabokov its been too long. Sry i cannot remember already – aceminer Mar 06 '19 at 06:58
  • @aceminer I guess you transpose matrix because in your data "each column represents a variable, while the rows contain observations". For that case it is more informative to use the parameter rowvar = False, not transpose. See the answer from Nikolas Rieble. – Evgeni Nabokov Mar 06 '19 at 20:37
  • Mathematically speaking, the PCA itself does not include StandardScaler().fit_transform(x); if you standardize it, the PCA is performed on the correlation matrix, rather than the co-variance matrix; that is why scikit learn does not put that step. – Tony Apr 14 '20 at 13:13
  • `x_std = StandardScaler().fit_transform(x)` is not a numpy method as per the subtitle – jtlz2 Jun 09 '21 at 10:32

2 Answers2

28

The difference is because decomposition.PCA does not standardize your variables before doing PCA, whereas in your manual computation you call StandardScaler to do the standardization. Hence, you are observing this difference: PCA on correlation or covariance?

If you replace

pca.fit_transform(x)

with

x_std = StandardScaler().fit_transform(x)
pca.fit_transform(x_std)

you will get the same result as with manual computation...

...but only up to the order of the PCs. That is because when you run

ev , eig = np.linalg.eig(cov)

you get eigenvalues not necessarily in the decreasing order. I get

array([ 0.07168571,  2.49382602,  1.43448827])

So you will want to order them manually. Sklearn does that for you.


Regarding reconstructing original variables, please see How to reverse PCA and reconstruct original variables from several principal components?

amoeba
  • 93,463
  • 28
  • 275
  • 317
  • Just would like to check. Is it really necessary to standardize the matrix by its standard deviation? I seen examples where they do not do it – aceminer Sep 20 '16 at 13:50
  • It is not *necessary*, it is just one way of doing it. See the link that I put in the first paragraph: http://stats.stackexchange.com/questions/53 -- it really is all about this question. If you standardize, you do PCA on correlations. If you do not, you do PCA on covariances. – amoeba Sep 20 '16 at 14:07
17

Here is a nice implementation with discussion and explanation of PCA in python. This implementation leads to the same result as the scikit PCA. This is another indicator that your PCA is wrong.

import numpy as np
from scipy import linalg as LA

x = np.array([
        [0.387,4878, 5.42],
        [0.723,12104,5.25],
        [1,12756,5.52],
        [1.524,6787,3.94],
    ])

#centering the data
x -= np.mean(x, axis = 0)  

cov = np.cov(x, rowvar = False)

evals , evecs = LA.eigh(cov)

you need to sort the eigenvalues (and eigenvectors accordingly) descending

idx = np.argsort(evals)[::-1]
evecs = evecs[:,idx]
evals = evals[idx]

a = np.dot(x, evecs) 

Generally, I recommend you to you check your code by implementation a simple example (as simple as possible) and calculating by hand the correct resuls (and intermediate results). This helps you to identify the problem.

Nikolas Rieble
  • 3,131
  • 11
  • 36
  • 1
    Love this answer. It solved my problem! – Jinhua Wang Jul 12 '18 at 15:28
  • Note that `linalg.eigh` assumes the input is symmetric (that is, it is optimized to read only those values from one "triangle" half of the matrix). The OP uses `linalg.eig` instead, which makes no such assumption. – zr0gravity7 Feb 28 '22 at 22:12