I'm doing principal components analysis (PCA) on quite a bit of data (3000 variables, 100079 data points). I'm doing this mostly for fun; data analysis is not my day job.
Normally, to do a PCA I would calculate the covariance matrix and then find its eigenvectors and corresponding eigenvalues. I understand very well how to interpret both of these, and find it a useful way to get to grips with a data set initially.
However, I've read that with such a large data set it's better (faster and more accurate) to do the principal components analysis by doing singular value decomposition (SVD) on the data matrix instead.
I have done this using SciPy's svd
function. I don't really understand SVD, so I might not have done it right (see below), but assuming I have, what I end up with is (1) a matrix U
, which is of size $3000\times 3000$; a vector s
of length $3000$, and a matrix V
of size $3000\times 100079$. (I used the full_matrices=False
option, otherwise it would have been $100079\times 100079$, which is just silly.)
My questions are as follows:
It seems plausible that the singular values in the
s
vector might be the same as the eigenvalues of the correlation matrix. Is this correct?If so, how do I find the eigenvectors of the correlation matrix? Are they the rows of
U
, or its columns, or something else?It seems plausible that the columns of
V
might be the data transformed into the basis defined by the principal components. Is this correct? If not, how can I get that?
To do the analysis, I simply took my data in a big $3000 \times 100079$ numpy array and passed it to the svd
function. (I'm aware that one should normally center the data first, but my intuition says I probably don't want to do this for my data, at least initially.) Is this the right way to do it? Or should I do something special to my data before passing it to this function?