2

I'm completing scientific analysis of chemical compounds in consumer products. As a non-statistician, I would really appreciate any thoughts from the experts here.

My data is non-normal so I've used non-parametric tests like MW and KW for hypothesis testing between samples so far. However I now have to conduct a principal component analysis (PCA) of the different compounds measured in the different products (measured in different units).

The stats add-in I was using asks that the type of data format be specified (eg: observation/variable table, versus a correlation or covariance matrix). I'm working with straight data so used the observation/variable table set-up.

But it also asks me to specify the PCA type from the following options (Pearson(n), Pearson (n-1), Spearman, Kendall, Covariance...). I tested the same data set with the Pearson (n) option and the Spearman option and got very different eigenalues and eigenvectors. The final biplot is naturally quite different.

Any help someone can provide regarding what the difference is, and what PCA type should be used would be greatly appreciated.

UPDATE: I was using XLSTAT (an Excel add-in). Is it okay to use Pearson as the "PCA type" when the correlations between the variables are non-linear? For example this "PCA type" option does not appear in other stats programs (eg: SPSS). So for example if using SPSS, the novice user would by default use Pearson "pca type".

Oleic
  • 93
  • 3
  • 11
  • 1
    This sounds very odd in implying that Pearson (n) and Pearson (n-1) are different. There aren't different formulas or ideas for Pearson correlation. There are different formulas for standard deviation, but it's immaterial which you use for Pearson correlation as identical terms cancel in numerator and denominator. This may sound like a minute detail, but it could be diagnostic of poor software. All you say is that is an "add-in" (to what? MS Excel?). – Nick Cox Jul 01 '13 at 07:52
  • Correction: There are different formulas for correlation, but they are different versions of each other. – Nick Cox Jul 01 '13 at 08:28
  • +1 to @Nick. Whether you choose n or n-1 for your standard deviation, Pearson _r_ is the same afterwards. Because _r_ is just a _cosine_ for centered variables, which formula doesn't include df term; so your implicit using of df is being canceled out. – ttnphns Jul 01 '13 at 08:32
  • Also, option `Kendall` is suspicious. Linear PCs are linear combinations of data values - original or transformed, - which implies that the association measure _must_ be of [SSCP type](http://stats.stackexchange.com/a/22520/3277). Pearson _r_ is; Spearman _rho_ also is, being just _r_ on ranks. But Kendall _tau_ doesn't seem to be _r_ or cosine of some data. Using Kendall in PCA is not justified. – ttnphns Jul 01 '13 at 08:41
  • `I now have to conduct a principal component analysis (PCA) of the different compounds measured in the different products (measured in different units)` Can you tell us more about it? Show a snippet of your data. Between what entities and why would you want to do PCA? – ttnphns Jul 01 '13 at 09:29
  • Thanks for the comments. I was using xlstat (an excel add in). I've added some more information in my post. – Oleic Jul 01 '13 at 21:30
  • 1
    For chemical concentrations, your reflex--and default approach--should be to analyze their logarithms. This is (in part) because the logarithms are what enter linearly in equations of chemical equilibria rather than the concentrations themselves. – whuber Jul 01 '13 at 21:56

1 Answers1

2

The principal vectors are the eigenvectors of the the matrix you choose. When you choose Pearson you are choosing to find the eigenvectors of the Pearson correlation matrix. When you choose Spearman you are choosing to find the eigenvectors of the Spearman rank correlation matrix. The Spearman rank correlation is just the Pearson correlation between the ranked variables.

Due to the major difference in the nature of these matrices, it makes sense that the produced biplots are very different. If you really believe that the correlation between your variables is linear then I would stick with Pearson, otherwise I would go with Spearman or Kendall.

Jorge Banuelos
  • 316
  • 1
  • 5
  • Thanks @Jorge. I am testing ~6 variables and the correlation is linear between some but not others (although I'm not sure if testing in pairs is the best way to test linearity, is there another way to do this?). The results make more "scientific" sense with Spearman however. – Oleic Jul 01 '13 at 04:54
  • 1
    The Spearman correlation will capture the linear correlations that the Pearson correlation captures between the non-ranked transformed variables, along with some nonlinear correlations. Also, correlation matrices only capture pairwise interactions. It seems to me that the loading you obtain from PCA are your primary interest, so I would just stick to the Spearman correlation. It is not clear to me why you want to capture higher order interactions between your variables. – Jorge Banuelos Jul 01 '13 at 05:03
  • Thanks. Looking at multiple chemical compounds (my variables) in different material `phases'. The variables can be used in different combinations with each other so we wanted to see if there were patterns between all the compounds. Not actually convinced PCA is the best way to do this, but it seems to be popular approach currently in my field. – Oleic Jul 01 '13 at 05:21
  • `The Spearman rank correlation is just the Pearson correlation between the ranked variables.` A continuation suggests itself: "And thence the principal components out of Spearman matrix correspond to those ranked data and nor to the original data". – ttnphns Jul 01 '13 at 09:05
  • Isn't Spearman-based also more robust for non-normally-distributed data? – gerrit Dec 16 '15 at 17:39