1

I'm want to apply PCA to the kaggle's Titanic dataset

For now I'm just taking the columns that have numeric values and dropping the NaN values, So I have five variables, actually four if we ignore the depending variable ('Survived').

enter image description here

I have this loaded into a DataFrame df, if I took five components using PCA:

pca_model = PCA(n_components=5)
pca_model.fit(df)
pca_model.explained_variance_ratio_

[  9.30197643e-01   6.93699966e-02   2.24377672e-04   1.49076254e-04
   5.89069784e-05]

I got that 93 percent of the variance comes from the first component. Is it possible how can I get this same values from the original variables? E.G. Age -> 0.3 of the variance Fare -> 0.6

Can I now which percentage of the principal component is given by each of the original variables?

  • 1
    What you may be speaking is called PCA _loadings_. (Please search this site: `PCA loadings`.) Loading is the covariance or correlation between the unit-standardized component and a variable having its variance. Therefore loading squared is the amount of the variance in a variable accounted for by the component. Variance of the component (eigenvalue) is the sum of its squared loadings. – ttnphns Mar 02 '17 at 21:27
  • Read e.g. this: http://stats.stackexchange.com/q/143905/3277 – ttnphns Mar 02 '17 at 21:39

0 Answers0