PCA on binary data (0's & 1's) -> what does it mean when a PC is correlated with the nr of 1's per subject?

Question

I conducted a PCA on dichotomous variables (0's and 1's). The dataset consists of human subjects and a few thousand genetic variants, where the presence of a genetic variant is indicated with 0's and 1's.

My first PC correlates >.9 with the nr of 1's in a subject.

Is this expected?

Could this be an consequence of the fact that PCAs are actually not meant for binary data?

Or does this simply mean that subjects with more 1's resemble other subjects with more 1's (i.e., the more genetic variants are present, the more likely it is that those are the same variants as in another individuals with an approximately equal amount of genetic variants).

Or could there be an alternative explanation?

I hope the problem is well specified, otherwise please let me know! Many thanks!

How do you pre-process your datas ? the number of 1 ? Do you standarize it ? I suspect a "mean" factor to appears somewhere. As you work with 0 and 1 the means is directly linked to number of 1. — lcrmorin, May 15 '13 at 16:56
No, I don't standardize it. I just run a PCA on the 0's and 1's. I did it twice, once on a covariance matrix of the subjects with princomp in R, and once just putting the matrix of 0's and 1's in the following perl module: http://search.cpan.org/~dsth/Statistics-PCA-0.0.1/lib/Statistics/PCA.pm . The outcome is pretty much the same... The number of 1's per subject is also not standardized (just a column with the sums of all 1's for each subject/row). — Abdel, May 15 '13 at 17:01
Short answer is no. Think of it this way. PCA works on either the correlation or the covariance matrix. It never sees the means; they were subtracted out in calculating that matrix. So, if a variable with high mean (or low mean) is highly correlated with PC1, all that probably implies is that it reflects a bundle of relatively highly correlated variables. PCA pays no attention to means as such. — Nick Cox, May 15 '13 at 17:04
@Nick, Generally, you aren't right saying `either the correlation or the covariance`. Linear PCA works (and is being done) on any [scalar-product](http://stats.stackexchange.com/a/22520/3277) similarity. It may be covariances or correlations or cosines or raw SSCPs. In the latter two cases no mean subtraction occure which [affects](http://stats.stackexchange.com/a/22331/3277) the PCs greatly. — ttnphns, May 15 '13 at 20:09
@ttphns I take your point but see this as a matter of broad versus strict definitions. You could define any eigenvector-eigenvalue calculation as PCA if so minded. I regard PCA based on correlation or covariance matrices as the central definition in statistical science, and from the discussion before I commented those were the possibilities being considered. But you are naturally quite correct to emphasise wider definitions (which in my experience are more common outside mainstream statistics, e.g. in meteorology or oceanography). — Nick Cox, May 15 '13 at 20:22

PCA on binary data (0's & 1's) -> what does it mean when a PC is correlated with the nr of 1's per subject?

0 Answers0