2

My understanding of a PCA is that its main purpose is to reduce dimensionality among variables, as a smaller set of PCs can explain the majority of variance otherwise attributable to the other variables.

As an example, in my interpretation, if there are 5 independent variables, it would be expected with no other information that they should each explain ~20% of the variance. If after a PC the observed variance explained is not equal to 20%, but maybe the first 2 PCs explain 90% of the variance in the data, in this case it would seem to clearly reduce dimensionality. However, I have a data set where PCs are much closer to ~20% each.

        eigenvalue percentage of variance cumulative percentage of variance
Pcomp 1  1.3762857               27.60763                          27.60763
Pcomp 2  1.1718536               23.50682                          51.11446
Pcomp 3  0.9139234               18.33287                          69.44733
Pcomp 4  0.8245694               16.54047                          85.98780
Pcomp 5  0.6985312               14.01220                         100.00000

At what point would you accept the PCs don't sufficiently reduce dimensionality and just stick with the original variables? And is there a test to support this decision?

Intuitively a chi-square goodness of fit test would make sense to me (n=5), of observed vs expected outcomes (E = 20%), but I think this breaks assumptions of chi-square tests by using percentages?

Is there a simpler way to go about this or an appropriate test to apply?

      Var1  Var2  Var3  Var4  Var5 
Var1  1.00  0.29 -0.11 -0.03 -0.07 
Var2  0.29  1.00 -0.14 -0.03  0.00 
Var3 -0.11 -0.14  1.00 -0.01 -0.06
Var4 -0.03 -0.03 -0.01  1.00  0.16 
Var5 -0.07  0.00 -0.06  0.16  1.00

          PC1   PC2   PC3   PC4   PC5 
Var1     0.63 -0.05 -0.35 -0.03 -0.69 
Var2     0.64  0.09 -0.19 -0.26  0.69 
Var3    -0.40 -0.33 -0.72 -0.46  0.04 
Var4    -0.13  0.63 -0.54  0.53  0.09 
Var5    -0.10  0.70  0.15 -0.66 -0.20
Roasty247
  • 202
  • 2
  • 9
  • 1
    Possible duplicate of [Why does sphericity diagnosed by Bartlett's Test mean a PCA is inappropriate?](https://stats.stackexchange.com/questions/92791/why-does-sphericity-diagnosed-by-bartletts-test-mean-a-pca-is-inappropriate) – ttnphns Jul 09 '19 at 07:06
  • 4
    My general advice would be to disregard PCA in these cases, as you will lose interpretability for very little gain. However, a sphericity test as suggested by @ttnphns can be of great help. A quick look at the covariance matrix will also give you some hints – David Jul 09 '19 at 07:22
  • 5
    Depends on the purpose of the study too. If e.g. PC1 and PC2 can be given simple interpretations, then that is a bonus. Often that doesn't work well; sometimes it's not even important as a goal. If there is some kind of theory suggesting latent variables, then often that is better tested with structural equation models. I wouldn't ever do a formal test: a glance at the eigenvalues and seeing if the loadings suggest an interpretation is enough for me, but tastes differ. Calculating the correlations between the PCs and the original variables is a simple step often omitted but also helpful. – Nick Cox Jul 09 '19 at 07:43
  • 4
    Correct on chi-suare: percentage shares don't match what the usual test does, which is compare observed and expected frequencies (counts). That test makes no sense here. – Nick Cox Jul 09 '19 at 07:52
  • So looking at the correlation matrices from these data (added to OP), I don't think it makes sense to use PCs as interpretability is lost for little gain as @David suggests, but I did the Bartlett sphericity test anyway as suggested by @ttnphns, using `bart_spher()` from the `REdaS` package in R and I find it interesting to have such a low p-value, while correlations between variables are very low, close to zero and even contain a zero. Thanks all for suggestions. `Bartlett's Test of Sphericity Call: bart_spher(x = .) X2 = 49.639 df = 10 p-value < 2.22e-16` – Roasty247 Jul 09 '19 at 09:24
  • What is the goal of your analysis? Without knowing that this is difficult to answer. If it is regression modelling/prediction, maybe look into regularization as an alternative – kjetil b halvorsen Feb 21 '22 at 02:03

0 Answers0