Validating principal component analysis

Question

I just wanted to do this small experiment to make sure I understand PCA correctly. My dataset contains 8 columns. The first two columns are randomly generated in excel => randbetween(4, 5) and the other 6 columns are also generated in the same way but the formula used is => randbetween(1,3)

When I do PCA on this I am not getting good results. I expect that the result should indicate high eigenvalues for a factor that is a combination of first two columns and low on other colums. This is my code in R :

sensex.dat = read.csv('C:/Study/_SEM4/brand man/emperical/dice.csv', header = T)
attach(sensex.dat)
sensex.cov = cov(sensex.dat)
sensex.eigen = eigen(sensex.cov, symmetric = T)
sensex.eigen$values
sensex.eigen$vectors

Can you explain _why_ you expected that result? That is probably the most important information if you want to know whether you've understood PCA correctly. :) — MånsT, Aug 31 '12 at 05:39
The first 2 factors behave similarly and differently from the others, so shouldn't PCA combine the first 2 factors as one factor? — Prakhar, Aug 31 '12 at 05:57
No, because 'behave similarly' means 'be Pearson correlated' for PCA. Amplitude does not really matter. — , Aug 31 '12 at 06:11
For a working `R` example of how to construct a random dataset with specific PCA output and how to compare the actual output to the intended output, please see the answer at http://stats.stackexchange.com/a/35035. — whuber, Aug 31 '12 at 13:38

score 2 · Accepted Answer · answered Aug 31 '12 at 08:56

2

As others have told you PCA does not look for amplitude - in fact it is standard procedure to normalize your variables before a PCA. You did not do this by the way. It looks for correlations between the columns.

The result you want to generate you would get by

Randomly generating a column
Generating a second random column with similar parameters but also adding the first column to it. In your example this would basically be first column + randbetween.
Generate additional uncorrelated columns as in 1
Normalize and then get eigenvalues and vectors

answered Aug 31 '12 at 08:56

Erik

6,909
20
48

My understanding is that you wouldn't _always_ standardize before doing PCA (you always subtract the mean however). If, for example you were trying to model salaries of different occupations and you cared primarily about the \$ accuracy of approximating with a reduced set of components you might not standardize, choosing to accept higher relative errors on low salaries. OTOH if the variables are on different scales you'd always standardize. Discussed e.g. [here](http://stats.stackexchange.com/questions/62677/covariance-v-correlation-based-pca-theoretical-view/62699#62699) in more detail. – TooTone Mar 18 '14 at 21:27
Standardization when performing PCA on the correlation matrix (the usual approach, outside of a few fields, like morphometrics, where they use the covariance matrix) is irrelevant: cor$(\mathbf{X})$ = cor$((a\mathbf{X})+b)$ for $0 < a < \infty$, and $-\infty < b < \infty$. Covariance matrix applications of PCA entail more than simply standardizing variables, since the assumption about what each component contributes to total variance is substantively different. – Alexis Apr 24 '14 at 14:36

score 0 · Answer 2 · answered Mar 18 '14 at 02:50

The post referred to by whuber is quite useful for artificial data. Here's a simple PCA with your random data generated in R. The first two principal components explain about 40% of the variation, the remaining six explain the rest.

n=100
sensex.dat = matrix(NA,nrow=n,ncol=8)
sensex.dat[,1:2] = runif(n*2,4,5)
sensex.dat[,3:8] = runif(n*6,1,3)

p = princomp(sensex.dat,scale.=FALSE)
summary(p)
biplot(p,xlabs=c(rep('+',n)))
screeplot(p)

enter image description here

Validating principal component analysis

2 Answers2