I'm doing principal components analysis (PCA) on quite a bit of data (3000 variables, 100079 data points). I'm doing this mostly for fun; data analysis is not my day job.
Normally, to do a PCA I would calculate the covariance matrix and then find its eigenvectors and corresponding eigenvalues. I understand very well how to interpret both of these, and find it a useful way to get to grips with a data set initially.
However, I've read that with such a large data set it's better (faster and more accurate) to do the principal components analysis by doing singular value decomposition (SVD) on the data matrix instead.
I have done this using SciPy's svd function. I don't really understand SVD, so I might not have done it right (see below), but assuming I have, what I end up with is (1) a matrix U, which is of size $3000\times 3000$; a vector s of length $3000$, and a matrix V of size $3000\times 100079$. (I used the full_matrices=False option, otherwise it would have been $100079\times 100079$, which is just silly.)
My questions are as follows:
It seems plausible that the singular values in the
svector might be the same as the eigenvalues of the correlation matrix. Is this correct?If so, how do I find the eigenvectors of the correlation matrix? Are they the rows of
U, or its columns, or something else?It seems plausible that the columns of
Vmight be the data transformed into the basis defined by the principal components. Is this correct? If not, how can I get that?
To do the analysis, I simply took my data in a big $3000 \times 100079$ numpy array and passed it to the svd function. (I'm aware that one should normally center the data first, but my intuition says I probably don't want to do this for my data, at least initially.) Is this the right way to do it? Or should I do something special to my data before passing it to this function?