I have been using the principal()
function of the psych package in R and setting the number of components after a scree plot analysis (fa.parallel
).
But I don't understand why the number of components are important to the principal()
function. In any case the principal component of the data should not change with the number of component you setup, or am I mistaken?
In my mind the subspace and explained variance should not change...
First component should be the one explaining the most variance, the second with the most residual variance orthogonal to the first component etc.. no ?
In contrast, princomp
for example does not ask for the number of components... it makes more sense to me.
set.seed(23434)
X = seq(-5,10,0.2)
X = X + (rnorm(n = NROW(X),mean = 0, sd = 2))
Z = sample(x = seq(-4,12,0.1), size = NROW(X))+(rnorm(n = NROW(X),mean = 0, sd = 2))
Y = 2*X+Z+(rnorm(n = NROW(X),mean = 0, sd = 2))+3
# center the data
Ycent = Y - mean(Y)
Xcent = X - mean(X)
Zcent = Z - mean(Z)
# scree plot to determine the number of components
fa.parallel(data.frame(X=Xcent, Y=Ycent,Z=Zcent))
# compare different PCA methods
principal(data.frame(X=Xcent, Y=Ycent,Z=Zcent))
principal(data.frame(X=Xcent, Y=Ycent,Z=Zcent), nfactors = 2)
principal(data.frame(X=Xcent, Y=Ycent,Z=Zcent), nfactors = 3)
a =princomp(data.frame(X=Xcent, Y=Ycent,Z=Zcent))
loadings(a)
# Results
> fa.parallel(data.frame(X=Xcent, Y=Ycent,Z=Zcent))
Parallel analysis suggests that the number of factors = 2 and the number of components = 1
>
> # compare different PCA methods
> principal(data.frame(X=Xcent, Y=Ycent,Z=Zcent))
Principal Components Analysis
Call: principal(r = data.frame(X = Xcent, Y = Ycent, Z = Zcent))
Standardized loadings (pattern matrix) based upon correlation matrix
PC1 h2 u2 com
X 0.89 0.79 0.213 1
Y 1.00 0.99 0.006 1
Z 0.46 0.21 0.792 1
PC1
SS loadings 1.99
Proportion Var 0.66
Mean item complexity = 1
Test of the hypothesis that 1 component is sufficient.
The root mean square of the residuals (RMSR) is 0.23
with the empirical chi square 24.77 with prob < NA
Fit based upon off diagonal values = 0.83
> principal(data.frame(X=Xcent, Y=Ycent,Z=Zcent), nfactors = 2)
Principal Components Analysis
Call: principal(r = data.frame(X = Xcent, Y = Ycent, Z = Zcent), nfactors = 2)
Standardized loadings (pattern matrix) based upon correlation matrix
PC1 PC2 h2 u2 com
X 0.99 -0.09 1.00 0.0047 1.0
Y 0.92 0.38 0.99 0.0060 1.3
Z 0.09 1.00 1.00 0.0012 1.0
PC1 PC2
SS loadings 1.85 1.14
Proportion Var 0.62 0.38
Cumulative Var 0.62 1.00
Proportion Explained 0.62 0.38
Cumulative Proportion 0.62 1.00
Mean item complexity = 1.1
Test of the hypothesis that 2 components are sufficient.
The root mean square of the residuals (RMSR) is 0
with the empirical chi square 0.01 with prob < NA
Fit based upon off diagonal values = 1
> principal(data.frame(X=Xcent, Y=Ycent,Z=Zcent), nfactors = 3)
Principal Components Analysis
Call: principal(r = data.frame(X = Xcent, Y = Ycent, Z = Zcent), nfactors = 3)
Standardized loadings (pattern matrix) based upon correlation matrix
PC1 PC2 PC3 h2 u2 com
X 0.99 -0.08 -0.06 1 4.4e-16 1.0
Y 0.92 0.37 0.11 1 5.6e-16 1.4
Z 0.08 1.00 0.02 1 -1.6e-15 1.0
PC1 PC2 PC3
SS loadings 1.84 1.14 0.02
Proportion Var 0.61 0.38 0.01
Cumulative Var 0.61 0.99 1.00
Proportion Explained 0.61 0.38 0.01
Cumulative Proportion 0.61 0.99 1.00
Mean item complexity = 1.1
Test of the hypothesis that 3 components are sufficient.
The root mean square of the residuals (RMSR) is 0
with the empirical chi square 0 with prob < NA
Fit based upon off diagonal values = 1
> a =princomp(data.frame(X=Xcent, Y=Ycent,Z=Zcent))
> loadings(a)
Loadings:
Comp.1 Comp.2 Comp.3
X -0.361 -0.472 0.804
Y -0.917 -0.398
Z -0.170 0.881 0.441
Comp.1 Comp.2 Comp.3
SS loadings 1.000 1.000 1.000
Proportion Var 0.333 0.333 0.333
Cumulative Var 0.333 0.667 1.000
As a side question, I do not understand why princomp
explained variance is 0.33 for all components either... Shouldn't the first one explain the most variance ?