4

I have noticed that when applying PCA to large datasets, people often will first subset the data considerably. Sometimes people just randomly take a subset of the features/variables, but often they have a reason, largely related to removing variables they consider to be likely to be noise. A prototypical example is in the data analysis the Drop-Seq single cell sequencing of retina cells, the authors subset their expression data matrix from 25,000 genes to the the 384 most highly variable genes and then proceed to use various unsupervised dimensionality reduction techniques like PCA and t-SNE.

I have seen this sort of pre-processing in several other places as well. However, I don't understand why this sort of subsetting (feature pre-selection) is necessary. PCA will reduce the dimensionality such that the variance will be maximized--hence, the genes that are not varying will be largely ignored. Why so dramatically subset the data when the non-varying genes should not really have much of an effect on the result of PCA?

This is not a specific question about this paper, it seems to be something of a standard approach to large datasets, so I assume that there is something I am missing.

amoeba
  • 93,463
  • 28
  • 275
  • 317
user310374
  • 301
  • 3
  • 10
  • More specifically, it is to identify the most variable genes after controlling for the relationship between mean expression and variability. They calculated the dispersion (variance/mean) and the standard deviation for each gene, binned them by their average expression, and then z-normalized the dispersion measure in each bin, and used a z-score cutoff of 1.7. This is detailed in the supplementary material of the paper. – user310374 May 10 '16 at 14:34
  • 2
    [This one is probably a related question](http://stats.stackexchange.com/questions/50537/should-one-remove-highly-correlated-variables-before-doing-pca) – PolBM May 10 '16 at 14:38

1 Answers1

5

The paper itself is openly available online, but its supplementary materials are not, so I copy here the relevant parts. Here is how the authors ran PCA:

We ran Principal Components Analysis (PCA) on our training set as previously described (Shalek et al., 2013), using the prcomp function in R, after scaling and centering the data along each gene. We used only the previously identified “highly variable” genes as input to the PCA in order to ensure robust identification of the primary structures in the data.

And here is how they selected the "highly variable" genes:

We first identified the set of genes that was most variable across our training set, after controlling for the relationship between mean expression and variability. We calculated the mean and a dispersion measure (variance/mean) for each gene across all 13,155 single cells, and placed genes into 20 bins based on their average expression. Within each bin, we then z-normalized the dispersion measure of all genes within the bin, in order to identify outlier genes whose expression values were highly variable even when compared to genes with similar average expression. We used a z-score cutoff of 1.7 to identify 384 highly variable genes.


Notice two things:

  1. They select "highly variable" genes based on their variances (relative to the mean, but this is not important here). The genes with unusually large variances will get selected.

  2. They perform PCA after scaling, i.e. $z$-scoring, the data for each gene. In other words, PCA is done on the correlation matrix, not on covariance matrix. The scaled genes that go into PCA all have the same variance equal to $1$.

This explains why the pre-selection is not superfluous here: the PCA is done on correlations, i.e. without using the variance information at all; the variances are only used for pre-selection.

One can certainly imagine a situation where PCA on correlations between all genes and PCA on correlations between "highly variable" genes will yield very different results. E.g. in principle it can happen that the least variable genes are higher correlated (and will dominate in PCA) than the highly variable genes.

I have no experience with such data, so I cannot comment on how useful this procedure is in this particular application domain.

amoeba
  • 93,463
  • 28
  • 275
  • 317
  • I think I am missing a crucial understanding about the difference between PCA on scaled data vs. on unscaled data. You say that it is the difference between PCA on a correlation matrix vs. PCA on a covariance matrix. How would that give different results? Is it standard to scale data before doing PCA on it? – user310374 May 11 '16 at 13:48
  • @user310374: It is one standard way to run PCA, yes. Please see here: [PCA on correlation or covariance?](http://stats.stackexchange.com/questions/53) Feel free to ask for clarifications after you read through the answers there. – amoeba May 11 '16 at 13:53
  • That is excellent, thank you @amoeba! So in this case, if the data were not scaled, then using all the genes would potentially make more sense, because then the highly varying genes would be able to dominate the PCA. But after scaling, the genes that are just low-variance noise will dominate the top PCs, so it makes more sense to subset them out. – user310374 May 11 '16 at 14:03
  • @user310374 Basically yes, but it's not necessarily true that after scaling the low-variance genes would dominate the top PCs. If the low-variance genes are also poorly correlated with each other, then they wouldn't matter. But they *can* dominate the top PCs if they are correlated higher than (or at least comparably to) the high-variance genes. It is a matter of judgement to decide what is important here: high variance or high correlation. This particular analysis makes some sort of a compromise. – amoeba May 11 '16 at 14:11
  • 1
    Thanks for posting the supplementaries. I struggle to see why this procedure is a good idea or why it makes the results more 'robust'. – conjectures May 11 '16 at 17:12
  • @amoeba, after reading the supplementary more closely, it seems that their PCA is actually done with the cells being features. They are not trying to project the cells onto the principal components like I thought. For example, it says that "the number of principal components is equal to the number of profiled cells," which would only be true if this were the case. They are interested in the loadings of the individual cells along the principal components, not the projection of the cells onto the principal components. So, I guess here, subsetting the genes is more down-sampling. – user310374 May 11 '16 at 23:39
  • 1
    @user310374 I would have to look at it again, but I suspect you are mistaken. The number of PCs is limited to the number of cells because there is less cells then genes. In the $n

    – amoeba May 11 '16 at 23:41