2

I have been using the principal() function of the psych package in R and setting the number of components after a scree plot analysis (fa.parallel).

But I don't understand why the number of components are important to the principal() function. In any case the principal component of the data should not change with the number of component you setup, or am I mistaken? In my mind the subspace and explained variance should not change...

First component should be the one explaining the most variance, the second with the most residual variance orthogonal to the first component etc.. no ?

In contrast, princomp for example does not ask for the number of components... it makes more sense to me.

set.seed(23434)
X = seq(-5,10,0.2)
X = X + (rnorm(n = NROW(X),mean = 0, sd = 2))
Z = sample(x = seq(-4,12,0.1), size = NROW(X))+(rnorm(n = NROW(X),mean = 0, sd = 2))

Y = 2*X+Z+(rnorm(n = NROW(X),mean = 0, sd = 2))+3

# center the data
Ycent = Y - mean(Y)
Xcent = X - mean(X)
Zcent = Z - mean(Z)

# scree plot to determine the number of components
fa.parallel(data.frame(X=Xcent, Y=Ycent,Z=Zcent))

# compare different PCA methods
principal(data.frame(X=Xcent, Y=Ycent,Z=Zcent))
principal(data.frame(X=Xcent, Y=Ycent,Z=Zcent), nfactors = 2)
principal(data.frame(X=Xcent, Y=Ycent,Z=Zcent), nfactors = 3)
a =princomp(data.frame(X=Xcent, Y=Ycent,Z=Zcent))
loadings(a)

# Results
> fa.parallel(data.frame(X=Xcent, Y=Ycent,Z=Zcent))
Parallel analysis suggests that the number of factors =  2  and the number of components =  1 
> 
> # compare different PCA methods
> principal(data.frame(X=Xcent, Y=Ycent,Z=Zcent))
Principal Components Analysis
Call: principal(r = data.frame(X = Xcent, Y = Ycent, Z = Zcent))
Standardized loadings (pattern matrix) based upon correlation matrix
   PC1   h2    u2 com
X 0.89 0.79 0.213   1
Y 1.00 0.99 0.006   1
Z 0.46 0.21 0.792   1

                PC1
SS loadings    1.99
Proportion Var 0.66

Mean item complexity =  1
Test of the hypothesis that 1 component is sufficient.

The root mean square of the residuals (RMSR) is  0.23 
 with the empirical chi square  24.77  with prob <  NA 

Fit based upon off diagonal values = 0.83
> principal(data.frame(X=Xcent, Y=Ycent,Z=Zcent), nfactors = 2)
Principal Components Analysis
Call: principal(r = data.frame(X = Xcent, Y = Ycent, Z = Zcent), nfactors = 2)
Standardized loadings (pattern matrix) based upon correlation matrix
   PC1   PC2   h2     u2 com
X 0.99 -0.09 1.00 0.0047 1.0
Y 0.92  0.38 0.99 0.0060 1.3
Z 0.09  1.00 1.00 0.0012 1.0

                       PC1  PC2
SS loadings           1.85 1.14
Proportion Var        0.62 0.38
Cumulative Var        0.62 1.00
Proportion Explained  0.62 0.38
Cumulative Proportion 0.62 1.00

Mean item complexity =  1.1
Test of the hypothesis that 2 components are sufficient.

The root mean square of the residuals (RMSR) is  0 
 with the empirical chi square  0.01  with prob <  NA 

Fit based upon off diagonal values = 1
> principal(data.frame(X=Xcent, Y=Ycent,Z=Zcent), nfactors = 3)
Principal Components Analysis
Call: principal(r = data.frame(X = Xcent, Y = Ycent, Z = Zcent), nfactors = 3)
Standardized loadings (pattern matrix) based upon correlation matrix
   PC1   PC2   PC3 h2       u2 com
X 0.99 -0.08 -0.06  1  4.4e-16 1.0
Y 0.92  0.37  0.11  1  5.6e-16 1.4
Z 0.08  1.00  0.02  1 -1.6e-15 1.0

                       PC1  PC2  PC3
SS loadings           1.84 1.14 0.02
Proportion Var        0.61 0.38 0.01
Cumulative Var        0.61 0.99 1.00
Proportion Explained  0.61 0.38 0.01
Cumulative Proportion 0.61 0.99 1.00

Mean item complexity =  1.1
Test of the hypothesis that 3 components are sufficient.

The root mean square of the residuals (RMSR) is  0 
 with the empirical chi square  0  with prob <  NA 

Fit based upon off diagonal values = 1
> a =princomp(data.frame(X=Xcent, Y=Ycent,Z=Zcent))
> loadings(a)

Loadings:
  Comp.1 Comp.2 Comp.3
X -0.361 -0.472  0.804
Y -0.917        -0.398
Z -0.170  0.881  0.441

               Comp.1 Comp.2 Comp.3
SS loadings     1.000  1.000  1.000
Proportion Var  0.333  0.333  0.333
Cumulative Var  0.333  0.667  1.000

As a side question, I do not understand why princomp explained variance is 0.33 for all components either... Shouldn't the first one explain the most variance ?

  • 1
    That's because `principal` uses varimax rotation by default (and `princomp` does not). – amoeba May 22 '16 at 22:22
  • @amoeba Thank you ! So a rotation with a lower number of component than the total number of eigenvector does something close to FA ? Would an orthogonal rotation make sense with the total number of component ? – Albert James Teddy May 23 '16 at 05:23

1 Answers1

6

There are several (at least two) potentially confusing issues cropping up here.


Confusing issue #1: psych::principal() uses varimax rotation by default

The principal function is part of the psych package that is mostly focusing on doing factor analysis. Even though the principal function is doing PCA, it is following the FA-style approach. In particular, it applies varimax rotation by default (which I find rather confusing), meaning that the components that you get out of it are not principal components. They are "rotated principal components".

You can actually see it directly in the output where they are appropriately called RC1, RC2, etc., instead of PC1, PC2, etc.

Importantly, the result of varimax applied to $k$ PCs is not the same as the result of varimax applied to $k+1$ PCs. See Is PCA followed by a rotation (such as varimax) still PCA? for a lot of details on varimax rotation applied to PCA.

If you set rotate="none" input parameter, then no rotation will be performed, you will see PCs and not RCs in the output, and the components will not depend on how many of them are requested.


Confusing issue #2: the output of loadings(prcomp()) does not make sense because these are not loadings.

What princomp returns as loadings are just eigenvectors of the covariance matrix (scaled to unit length, as is standard). This is not what is known as "loadings" in factor analysis: the latter are eigenvectors scaled by the square roots of the corresponding eigenvalues. (See e.g. Loadings vs eigenvectors in PCA: when to use one or another? and How does "Fundamental Theorem of Factor Analysis" apply to PCA, or how are PCA loadings defined?)

The function loadings() probably assumes that it gets FA loadings as an input and consequently takes their sums of squares as "explained variance". When applied to PCA unit-scaled eigenvectors, this will result in identical explained variance of each component. That's why you get equal numbers in Proportion Var output. This is ridiculously misleading.

To get actual explained variances, instead of loadings(a) run summary(a). See here: What is the difference between summary() and loadings() for princomp() object in R?

amoeba
  • 93,463
  • 28
  • 275
  • 317
  • 1
    `which I find rather confusing`. To add, another confusing point is that the documentation says nothing about whether Kaiser normalization is done / not done / possible at all - with their varimax subcommand. – ttnphns May 24 '16 at 14:55
  • @ttnphns: Oh yes. Precisely. See the post scriptum that I added to [my answer here](http://stats.stackexchange.com/a/137003/28666) just a couple of days ago. It took GottfriedHelms and me a dozen of comments (now erased) to figure out what was causing that SPSS/R inconsistency. – amoeba May 24 '16 at 16:10