I conducted a PCA on dichotomous variables (0's and 1's). The dataset consists of human subjects and a few thousand genetic variants, where the presence of a genetic variant is indicated with 0's and 1's.
My first PC correlates >.9 with the nr of 1's in a subject.
Is this expected?
Could this be an consequence of the fact that PCAs are actually not meant for binary data?
Or does this simply mean that subjects with more 1's resemble other subjects with more 1's (i.e., the more genetic variants are present, the more likely it is that those are the same variants as in another individuals with an approximately equal amount of genetic variants).
Or could there be an alternative explanation?
I hope the problem is well specified, otherwise please let me know! Many thanks!