How to calculate "factor coordinates" for PCA in Python, to match SPSS output?

Question

I been trying to automate, using python, a PCA which is achieved using SPSS.

This is my code:

import numpy as np

data = np.genfromtxt('input.csv', delimiter=';', usecols=range(0, 6))

data = data.T
data /= np.linalg.norm(data)

corrmat = np.corrcoef(data)

eigenvalues, eigenvectors = np.linalg.eig(corrmat)

evals_order = np.argsort(-eigenvalues)
eigenvalues = eigenvalues[evals_order]
eigenvectors = eigenvectors[:, evals_order]
data = data[evals_order]

And this is the example data

array([[  26.2,   18.7,   21.8,  758.5,   14.7,   63. ],
       [  27.8,   19.5,   22.8,  757.3,   16.6,   65. ],
       [  27.1,   19.7,   22.9,  756.1,   16.9,   67. ],
       [  26.3,   19.6,   22.6,  757.7,   15.1,   62. ],
       [  30.3,   22.7,   26. ,  757. ,   20.3,   68. ],
       [  32. ,   24.1,   27.4,  757.4,   22.9,   71. ],
       [  32.1,   24.4,   27.8,  758. ,   26. ,   78. ],
       [  32.4,   24.8,   28.2,  758.8,   22.7,   68. ],
       [  32.4,   24.7,   27.6,  753.3,   22.8,   70. ],
       [  28.2,   23.9,   25.4,  756.1,   19.7,   75. ],
       [  28.1,   22. ,   24.5,  756.8,   19.6,   74. ],
       [  26.8,   19.8,   22.7,  758.6,   17.3,   70. ],
       [  25.5,   18.7,   21.7,  760.6,   15.6,   68. ],
       [  25. ,   18.4,   21.2,  759.5,   15.4,   70. ],
       [  26.9,   19.2,   22.7,  759.4,   16.4,   66. ],
       [  29.5,   21.6,   24.9,  756.6,   17.5,   62. ],
       [  29.1,   21.7,   24.8,  756.5,   19. ,   70. ],
       [  30. ,   23.8,   26.4,  756.6,   22.8,   77. ],
       [  31.4,   24.2,   27.1,  758.7,   23.4,   73. ],
       [  31.6,   24. ,   27.1,  756.7,   22.9,   71. ],
       [  31.1,   24.1,   25.4,  756. ,   22.1,   69. ],
       [  29.1,   23. ,   25.8,  756.7,   20.9,   74. ],
       [  28.7,   22.3,   24.9,  756.9,   19.9,   71. ],
       [  26.5,   19.7,   22.6,  760.3,   15.2,   65. ],
       [  27.3,   19.7,   23. ,  760.2,   16.2,   63. ],
       [  27. ,   19.4,   22.7,  761.3,   15.7,   63. ],
       [  27.9,   20. ,   23.4,  758.7,   15.8,   61. ],
       [  28.6,   21.6,   24.7,  757.8,   18.6,   67. ],
       [  30.5,   23.3,   26.4,  757.8,   20.1,   67. ],
       [  31.1,   23.5,   26.9,  758.2,   20.8,   67. ],
       [  30.9,   23.9,   26.9,  758.7,   22.3,   70. ],
       [  31.4,   24.4,   27.5,  756.7,   23. ,   72. ],
       [  31.9,   24.1,   27.3,  755.1,   22.9,   69. ],
       [  29.6,   22.8,   25.7,  757. ,   20.1,   69. ],
       [  28.7,   22.3,   24.9,  757.2,   20. ,   74. ],
       [  25.6,   19. ,   21.8,  759.1,   15.7,   68. ]])

with those data SPSS outputs

Factor coordinates of the variables, based on correlations

     Factor 1    Factor 2    Factor 3
X1  -0.940527    0.291237   -0.140736
X2  -0.981433    0.072199   -0.078509
X3  -0.967474    0.167024   -0.156249
X4   0.655641   -0.095169   -0.748961
X5  -0.979639   -0.073088   -0.141371
X6  -0.671227   -0.740680    0.011958

I have read:

but none of them seem to be what I'm looking for.

which step is missing to get the same output. – engel May 10 '17 at 15:21 — engel, May 10 '17 at 15:21

keepAlive · Accepted Answer · 2017-05-10T20:29:08.877

With your corrmat (and to get the same output as SPSS using python's library numpy) I would do

>>> eigenvalues   = np.linalg.eigvals(corrmat)
>>> _eigenvectors = np.linalg.eig(corrmat)[1]
>>> eigenvectors  = - _eigenvectors * np.sign(np.sum(_eigenvectors, 0))
                    ^

Note the presence of the minus sign above, which as you surely know, can be reversed without changing the variance that is contained in components. Actually, I flip eigenvectors simply to get the ones given by SPSS.

And finally

>>> eigenvectors*pow(eigenvalues, .5)
[[-0.9405272747183386  0.2912371623961133 -0.1407363781821823  0.0912757427551984 -0.0494647032587364 -0.0021481439731338]
 [-0.9814331113889905  0.0721992935972806 -0.0785090923649322 -0.1459649895629314 -0.0045603280920887 -0.0639222771731283]
 [-0.9674737210319674  0.1670238493452638 -0.156248635777037  -0.0559392103161693  0.0185365986221359  0.0906156495368767]
 [ 0.6556408081143963 -0.0951692221784938 -0.7489613102250138 -0.0103839230577475 -0.0029680173560357 -0.0042744222974036]
 [-0.979638613927672  -0.0730875114449731 -0.1413705590288849  0.1070180886447224  0.0453366719942605 -0.0383729291014583]
 [-0.6712266404563406 -0.7406796816331325  0.0119583064943183 -0.0001786453151769 -0.0198064550991169  0.0176940014010874]]

This is one way to calculate Factor coordinates in PCA using python.

The paper which helped me understanding that is Yoel Haitovsky (1966)'s.

Im gonna read the peaper to understand what have you done. Please can you take a look to 2nd column. The sign is the opossite from expected. — engel, May 10 '17 at 15:41
@engel. Did you find the same results as the ones reported by SPSS ? — keepAlive, May 10 '17 at 15:48
yes, results are exactly equals as your except the 2nd column. Im gonna try with diffrent datasets to see what happends. — engel, May 10 '17 at 16:00
@engel Ok ***solved***. The problem really was that one component **only** was fliped. Note the part `np.sign(np.sum(_eigenvectors, 0))`: by doing so I follow some implementations which change the sign of a factor so that the positive values in it will dominate in sum. — keepAlive, May 10 '17 at 19:00
when I use eig and eigvals func, the result sometimes is complex. eigh and eigvalsh can solve this problem. — xingpei Pang, Jul 03 '19 at 08:04

How to calculate "factor coordinates" for PCA in Python, to match SPSS output?

1 Answers1