0

I am trying to reduce a high dimensional data using FactomineR. I created a training and test data set and did a PCA.

`bound <- floor((nrow(GK2)/4)*3)         #define % of training and test set

GK2 <- GK2[sample(nrow(GK2)), ]           #sample rows 
GK2.train <- GK2[1:bound, ]              #get training set
GK2.test <- GK2[(bound+1):nrow(GK2), ]    #get test set
GKPCA<-PCA(GK2.train)` 


GKPCA$eig
eigenvalue percentage of variance cumulative percentage of variance
comp 1  52.259733827           57.428278931                          57.42828
comp 2   7.152528027            7.859920909                          65.28820
comp 3   5.099126890            5.603436143                          70.89164
comp 4   4.064143884            4.466092181                          75.35773
comp 5   3.600750943            3.956869169                          79.31460
comp 6   3.138452260            3.448848637                          82.76345
comp 7   2.894380868            3.180638316                          85.94408
comp 8   2.287930806            2.514209677                          88.45829
comp 9   1.971793852            2.166806431                          90.62510
comp 10  1.572952777            1.728519535                          92.35362
comp 11  1.435328251            1.577283792                          93.93090
comp 12  1.173586263            1.289655234                          95.22056
comp 13  1.066540121            1.172022111                          96.39258
comp 14  0.727976218            0.799973866                          97.19255
comp 15  0.580969070            0.638427550                          97.83098
comp 16  0.554111316            0.608913534                          98.43990

From my understanding, the first 5/6 components are the variables that "matter" the most. I know this is a silly question, but how do I get the names of these (comp1, comp2, etc..? If my understanding please let me know. Or any other suggestions on how I would reduce the data (to find the most 'important' variables) which I will then use to do a cluster analysis.

Some notes from data

  • It includes many 0's and thus Inf when PCA is done.
  • The data set is goal keeper (player) performance (it was match by match but I aggregrate it to make more sense)

  • There are many variables compared to obervations (40 obs 196
    Variables)

  • GKPCA$loadings=null

  • couldnt do princomp function

  • Those variables do not actually have names, they are optimally rotated combinations of your initial variables. – Mike Wise Dec 26 '15 at 14:07
  • That helps, but I am still not how i would reduce the dimensions, which would than help me do a cluster analysis. Sorry for my stupidity, maybe im just not getting it. –  Dec 26 '15 at 14:28
  • 2
    You transform your data to the new coordinates (basically a rotation to the PCA vectors). Then the first coordinate is the most important new variable (but without a name, or an easy interpretation), the second is the next, one, and so on. Maybe you have 5 now. Then you use those five and do some kind of clustering or plotting. – Mike Wise Dec 26 '15 at 14:34
  • If you want to retain original variables and do cluster analysis in a reduced number of dimensions, you can always... not use all the variables. :) – Roman Luštrik Dec 26 '15 at 16:04
  • I am trying to figure out what variables are important. As of now i have 200 Variables, I am trying to break it down to see which are relevant (and than do a cluster analysis) –  Dec 26 '15 at 20:04
  • here is how you can do it with the R-function prcomp: http://stats.stackexchange.com/questions/169440/determining-pca-scores-for-a-new-data-point/169482#169482 –  Dec 28 '15 at 14:35

0 Answers0