3

I am working on a personal project, and I want to use Statsmodels' PCA on a dataset. The ultimate goal is to then perform a linear regression and evaluate its prediction. I know scikit-learn may be the preferred library, however I wanted to use Statsmodels because it provides more statistical information - I want to actually do T-tests of the coefficients (not F-tests, like in scikit-learn's f_regression) to verify their significance, not just rely on R Squared alone.

I first used scikit-learn's PCA which was very intuitive to use, but Statsmodels' doesn't have a method to transform test data using the previously found eigenvectors, so I attempted to implement this myself, and I'm discovering I have a gap in my understanding of how this is supposed to work. These are the steps I'm using:

First I pre-standardized the data using a scikit-learn method, so that I can split the data into training and test subsets. Then I apply the PCA method to the training dataset.

scaler = StandardScaler()
scaler.fit(X)
X = pd.DataFrame(scaler.transform(X),columns=X.columns.values.tolist())

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.2)

pca = PCA(X_train,standardize=False)

Now, according to my previous understanding, and also how I understand the responses to this related question, you would be able to apply the results of the PCA to the test dataset by multiplying the data with the eigenvector matrix. I wanted to verify this so I first tried applying it to the training data:

X_train_new = pd.DataFrame(np.dot(X_train,pca.eigenvecs),index=X_train.index).add_prefix('comp_')

However the result I get from the above is completely different from the scores returned by PCA:

In [118]: X_train_new.head()

Out[118]: 
        comp_0    comp_1    comp_2    comp_3    comp_4    comp_5    comp_6

300  -0.487775 -0.808892 -0.017517 -0.457110  0.156338 -0.462428  0.104817   
268   0.884563 -1.017728  0.856010 -0.490574 -0.578573  0.073373  0.117204   
171   1.026855  0.749209 -1.249890  0.700690 -1.597226  0.723078  0.608291   
233   0.481844 -0.526096 -0.828919 -0.911208  1.213542  0.263043  0.348731   
48   10.115987 -4.366090 -3.692285  0.988589 -2.519107 -0.764554  0.197523   

   comp_7    comp_8    comp_9  
300  0.264924 -0.012484 -0.026111  
268  0.006898  0.209725 -0.018939  
171  0.643990 -0.059137 -0.017679  
233  0.021720 -0.032748 -0.035821  
48  -0.190642 -0.378972 -0.356064  

In [120]: pca.scores.head()

Out[120]: 
       comp_0    comp_1    comp_2    comp_3    comp_4    comp_5    comp_6

300 -0.015516 -0.028511 -0.001990 -0.025318  0.010894 -0.039532  0.008800   
268  0.023603 -0.035698  0.035028 -0.027374 -0.037984  0.005089  0.010038   
171  0.027659  0.025113 -0.054214  0.045837 -0.105733  0.059196  0.059115   
233  0.012123 -0.018778 -0.036375 -0.053225  0.081207  0.020885  0.033176   
48   0.286748 -0.150936 -0.157716  0.063531 -0.167046 -0.064693  0.018065   

   comp_7    comp_8    comp_9  
300  0.042726 -0.004602 -0.030341  
268 -0.000906  0.041061 -0.021901  
171  0.106825 -0.014189 -0.020419  
233  0.001600 -0.008767 -0.041768  
48  -0.034310 -0.079915 -0.418612  

For reference, below are the original data values and eigenvectors. What am I missing here? How do I do this properly?

In [121]: X_train.head()

Out[121]: 
      enrltot  teachers   calwpct   mealpct  computer   compstu   expnstu

300 -0.015298  0.047593 -0.678221 -0.736705  0.051306 -0.139168  0.138337   
268  0.789370  0.791750 -0.237016 -0.540257  0.421076 -0.776026 -0.522788   
171  0.345971  0.431206  2.493394  0.260812  0.686493  0.251184  0.072822   
233  0.227509  0.315057 -0.303749  0.174288  0.262279 -0.259329  0.521848   
48   6.280558  6.925998  2.270078  1.461358  6.852336 -0.209820  0.871722   

   str    avginc     elpct  
300 -0.542181  0.284436 -0.023623  
268  0.496390  0.375899 -0.389026  
171 -0.361609 -0.355736 -0.379221  
233 -0.501464  0.991566  1.332658  
48  -0.329670 -0.444414  0.717009  

In [122]: pca.eigenvecs

Out[122]: 
   eigenvec_0  eigenvec_1  eigenvec_2  eigenvec_3  eigenvec_4  eigenvec_5  

0    0.472212   -0.298852   -0.069933    0.004498   -0.108498   -0.086156   
1    0.468842   -0.305645   -0.093485   -0.002551   -0.114603   -0.099414   
2    0.206918    0.429286   -0.325422    0.115116   -0.463520    0.496344   
3    0.279964    0.463414   -0.205124    0.074066    0.141786    0.029554   
4    0.419081   -0.337337   -0.119961    0.155727   -0.089232   -0.054511   
5   -0.209166   -0.175605   -0.271319    0.874289    0.250622    0.060924   
6   -0.142078   -0.103631   -0.628986   -0.240868    0.000895    0.139945   
7    0.241967    0.038432    0.572694    0.196042    0.078460    0.538291   
8   -0.169597   -0.469507   -0.082240   -0.263187    0.178807    0.646749   
9    0.330350    0.202343   -0.141508   -0.163969    0.794505    0.033980   

   eigenvec_6  eigenvec_7  eigenvec_8  eigenvec_9  
0   -0.046910   -0.077735    0.438922    0.680223  
1    0.000772   -0.062815    0.344505   -0.729640  
2    0.242404    0.357106    0.078833    0.006336  
3    0.040837   -0.789782   -0.113606   -0.000106  
4    0.062250    0.096983   -0.802598    0.059629  
5    0.014266    0.002641    0.143517   -0.009252  
6   -0.702480    0.005506   -0.047467   -0.009290  
7   -0.525088   -0.011045   -0.027148   -0.032770  
8    0.393957   -0.263067    0.005613    0.008665  
9    0.094312    0.399953    0.041957    0.002272  
llrs
  • 525
  • 5
  • 25

1 Answers1

3

I just figured out what I was doing wrong here. Since I standardized the data before running PCA, there were additional arguments that I needed to set as false:

pca = PCA(X_train,standardize=False, demean=False, normalize=False)

This changed the numbers slightly, but now the results match up, as shown below. Goes to show how important reading the documentation thoroughly is, I missed these parameters earlier!

In[139]: X_train_new = pd.DataFrame(np.dot(X_train,pca.eigenvecs),index=X_train.index).add_prefix('comp_')

In[140]: X_train_new.head()

Out[140]: 
        comp_0    comp_1    comp_2    comp_3    comp_4    comp_5    comp_6

300  -0.488200 -0.808226 -0.015942 -0.457189  0.157613 -0.462587  0.105420   
268   0.884898 -1.017663  0.856941 -0.491083 -0.577022  0.072337  0.115896   
171   1.026580  0.747384 -1.251388  0.692950 -1.599866  0.724840  0.608394   
233   0.481443 -0.526664 -0.827639 -0.909763  1.215789  0.262839  0.347564   
48   10.107996 -4.378685 -3.698964  0.983697 -2.521392 -0.764460  0.198071   

   comp_7    comp_8    comp_9  
300  0.264816  0.013475 -0.026100  
268  0.006735 -0.209005 -0.018908  
171  0.643124  0.061879 -0.017646  
233  0.020724  0.033462 -0.035804  
48  -0.192125  0.378399 -0.356084  

In[138]: pca.scores.head()

Out[138]: 
        comp_0    comp_1    comp_2    comp_3    comp_4    comp_5    comp_6

300  -0.488200 -0.808226 -0.015942 -0.457189  0.157613 -0.462587  0.105420   
268   0.884898 -1.017663  0.856941 -0.491083 -0.577022  0.072337  0.115896   
171   1.026580  0.747384 -1.251388  0.692950 -1.599866  0.724840  0.608394   
233   0.481443 -0.526664 -0.827639 -0.909763  1.215789  0.262839  0.347564   
48   10.107996 -4.378685 -3.698964  0.983697 -2.521392 -0.764460  0.198071   

   comp_7    comp_8    comp_9  
300  0.264816  0.013475 -0.026100  
268  0.006735 -0.209005 -0.018908  
171  0.643124  0.061879 -0.017646  
233  0.020724  0.033462 -0.035804  
48  -0.192125  0.378399 -0.356084 
llrs
  • 525
  • 5
  • 25