4

I been trying to automate, using python, a PCA which is achieved using SPSS.

This is my code:

import numpy as np

data = np.genfromtxt('input.csv', delimiter=';', usecols=range(0, 6))

data = data.T
data /= np.linalg.norm(data)

corrmat = np.corrcoef(data)

eigenvalues, eigenvectors = np.linalg.eig(corrmat)

evals_order = np.argsort(-eigenvalues)
eigenvalues = eigenvalues[evals_order]
eigenvectors = eigenvectors[:, evals_order]
data = data[evals_order]

And this is the example data

array([[  26.2,   18.7,   21.8,  758.5,   14.7,   63. ],
       [  27.8,   19.5,   22.8,  757.3,   16.6,   65. ],
       [  27.1,   19.7,   22.9,  756.1,   16.9,   67. ],
       [  26.3,   19.6,   22.6,  757.7,   15.1,   62. ],
       [  30.3,   22.7,   26. ,  757. ,   20.3,   68. ],
       [  32. ,   24.1,   27.4,  757.4,   22.9,   71. ],
       [  32.1,   24.4,   27.8,  758. ,   26. ,   78. ],
       [  32.4,   24.8,   28.2,  758.8,   22.7,   68. ],
       [  32.4,   24.7,   27.6,  753.3,   22.8,   70. ],
       [  28.2,   23.9,   25.4,  756.1,   19.7,   75. ],
       [  28.1,   22. ,   24.5,  756.8,   19.6,   74. ],
       [  26.8,   19.8,   22.7,  758.6,   17.3,   70. ],
       [  25.5,   18.7,   21.7,  760.6,   15.6,   68. ],
       [  25. ,   18.4,   21.2,  759.5,   15.4,   70. ],
       [  26.9,   19.2,   22.7,  759.4,   16.4,   66. ],
       [  29.5,   21.6,   24.9,  756.6,   17.5,   62. ],
       [  29.1,   21.7,   24.8,  756.5,   19. ,   70. ],
       [  30. ,   23.8,   26.4,  756.6,   22.8,   77. ],
       [  31.4,   24.2,   27.1,  758.7,   23.4,   73. ],
       [  31.6,   24. ,   27.1,  756.7,   22.9,   71. ],
       [  31.1,   24.1,   25.4,  756. ,   22.1,   69. ],
       [  29.1,   23. ,   25.8,  756.7,   20.9,   74. ],
       [  28.7,   22.3,   24.9,  756.9,   19.9,   71. ],
       [  26.5,   19.7,   22.6,  760.3,   15.2,   65. ],
       [  27.3,   19.7,   23. ,  760.2,   16.2,   63. ],
       [  27. ,   19.4,   22.7,  761.3,   15.7,   63. ],
       [  27.9,   20. ,   23.4,  758.7,   15.8,   61. ],
       [  28.6,   21.6,   24.7,  757.8,   18.6,   67. ],
       [  30.5,   23.3,   26.4,  757.8,   20.1,   67. ],
       [  31.1,   23.5,   26.9,  758.2,   20.8,   67. ],
       [  30.9,   23.9,   26.9,  758.7,   22.3,   70. ],
       [  31.4,   24.4,   27.5,  756.7,   23. ,   72. ],
       [  31.9,   24.1,   27.3,  755.1,   22.9,   69. ],
       [  29.6,   22.8,   25.7,  757. ,   20.1,   69. ],
       [  28.7,   22.3,   24.9,  757.2,   20. ,   74. ],
       [  25.6,   19. ,   21.8,  759.1,   15.7,   68. ]])

with those data SPSS outputs

Factor coordinates of the variables, based on correlations

     Factor 1    Factor 2    Factor 3
X1  -0.940527    0.291237   -0.140736
X2  -0.981433    0.072199   -0.078509
X3  -0.967474    0.167024   -0.156249
X4   0.655641   -0.095169   -0.748961
X5  -0.979639   -0.073088   -0.141371
X6  -0.671227   -0.740680    0.011958

I have read:

but none of them seem to be what I'm looking for.

amoeba
  • 93,463
  • 28
  • 275
  • 317
engel
  • 43
  • 3

1 Answers1

4

With your corrmat (and to get the same output as SPSS using python's library numpy) I would do

>>> eigenvalues   = np.linalg.eigvals(corrmat)
>>> _eigenvectors = np.linalg.eig(corrmat)[1]
>>> eigenvectors  = - _eigenvectors * np.sign(np.sum(_eigenvectors, 0))
                    ^

Note the presence of the minus sign above, which as you surely know, can be reversed without changing the variance that is contained in components. Actually, I flip eigenvectors simply to get the ones given by SPSS.

And finally

>>> eigenvectors*pow(eigenvalues, .5)
[[-0.9405272747183386  0.2912371623961133 -0.1407363781821823  0.0912757427551984 -0.0494647032587364 -0.0021481439731338]
 [-0.9814331113889905  0.0721992935972806 -0.0785090923649322 -0.1459649895629314 -0.0045603280920887 -0.0639222771731283]
 [-0.9674737210319674  0.1670238493452638 -0.156248635777037  -0.0559392103161693  0.0185365986221359  0.0906156495368767]
 [ 0.6556408081143963 -0.0951692221784938 -0.7489613102250138 -0.0103839230577475 -0.0029680173560357 -0.0042744222974036]
 [-0.979638613927672  -0.0730875114449731 -0.1413705590288849  0.1070180886447224  0.0453366719942605 -0.0383729291014583]
 [-0.6712266404563406 -0.7406796816331325  0.0119583064943183 -0.0001786453151769 -0.0198064550991169  0.0176940014010874]]

This is one way to calculate Factor coordinates in PCA using python.

The paper which helped me understanding that is Yoel Haitovsky (1966)'s.

keepAlive
  • 849
  • 1
  • 7
  • 17
  • Im gonna read the peaper to understand what have you done. Please can you take a look to 2nd column. The sign is the opossite from expected. – engel May 10 '17 at 15:41
  • @engel. Did you find the same results as the ones reported by SPSS ? – keepAlive May 10 '17 at 15:48
  • yes, results are exactly equals as your except the 2nd column. Im gonna try with diffrent datasets to see what happends. – engel May 10 '17 at 16:00
  • @engel Ok ***solved***. The problem really was that one component **only** was fliped. Note the part `np.sign(np.sum(_eigenvectors, 0))`: by doing so I follow some implementations which change the sign of a factor so that the positive values in it will dominate in sum. – keepAlive May 10 '17 at 19:00
  • 1
    when I use eig and eigvals func, the result sometimes is complex. eigh and eigvalsh can solve this problem. – xingpei Pang Jul 03 '19 at 08:04