2

I'm trying to do PCA for ~1000 patients & 100 genes. The goal is to visualize hidden groups within the patients. Each patient has the information of whether each of the 100 genes is mutated or not (1/0).

From R documentation page, scale=TRUE is recommended. However, I almost feel that this kind of variables should not scaled. Is my understanding correct?

amoeba
  • 93,463
  • 28
  • 275
  • 317
  • That's how you like it, keeping in mind what standardization means for binary variables. Hint: in a dichotomous variable both variance and mean are tied, they both are the effect of the asymmetry in the distribution. – ttnphns Aug 16 '17 at 02:57
  • thank you so much for the comment. Please excuse me for the naive question. However would you mind clarifying a bit more what you meant? in fact I tried on my particular data set and without scaling, there are some very distinct groups shown on 3D PCA plot.( 16 groups to be exact). A hierarchical clustering also shows similar groups. However the two do not match well. Those clusters also don't seem to be very good according to silhouette plots.. When I tried with scaling I can't really tell any distinct groups in the PCA. So I'm still not sure whether to scale or not – cafelumiere Aug 16 '17 at 15:03
  • You may want to study [this answer](https://stats.stackexchange.com/a/16335/3277) showing different ways of doing linear PCA with binary data. In short: it is your decision to standardize or not to standardize. Think whether you want to remove the effect of skewness on the variance. – ttnphns Aug 18 '17 at 19:35

0 Answers0