5

I'm trying to QA a process in which the data has over a million rows with approx 60,000 variables in a binary form. The aim of the process was to perform k-means clustering, but prior to this, the 60,000 variables were put through PCA to reduce dimensionality. My issue is that the data was split into batches of 5000 variables, so there were 12 separate PCAs conducted on 5000 variables each, then 500 PC's were kept from the 12 and then merged together.

My knowledge of PCA doesn't extend too far from the basics taught at university, but I just have a bad feeling that we may be missing out on too many of the correlations between variables that are separated into their own PCA.

Am I right to be concerned? Is there another approach that would be better, or is there a way to quantify what we may be losing out by doing this?

Mihai Chelaru
  • 269
  • 3
  • 11
brett
  • 53
  • 6

1 Answers1

6

Note: though the old answer (below the line) was accepted, the comment below alerted me to the fact that I had misinterpreted the question. My old answer pertains to comparing PCs on different batches of observations (i.e. different rows). But the question is actually about doing PCs on different batches of variables (i.e. different columns). I will now address this.

In order to reduce dimensionality, a PCA calculates orthogonal vectors from the entire set of variables. If you do not do the PCA on all variables, you are by definition not achieving this basic goal. By doing PCAs on 5000 variables at a time and retaining 500 PCs from each of the 12 batches, you are at risk of capturing plenty of redundant information in your final set of 6000 PCs. If there are a few dominant axes of variation, these would be captured over and over in each of the 12 batches. You could check the extent to which this is true by doing another PCA on your aggregated 6000 PCs.

As for better solutions, I'm not an expert, but here are a couple of thoughts. (i) There are Incremental PCA methods specifically designed for this, and I think they work by loading a few rows into memory at a time. (ii) As that implies, I think you need to use all variables (columns) to do the PCA, but you do not need to use all observations (rows). So a simple option is to do the PCA on a subset of the observations instead and then apply them to the rest of the dataset.


You're correct that this is a problem: based on how this has been done, the PCs cannot be compared with each other across batches [of observations].

This is mainly because even small differences in the covariance structure between batches will lead to different orthogonal vectors being identified. In other words, PC1 on batch 1 and PC1 on batch 2 represent different things! If you examine the loadings of some of the PCs across batches, you will see these differences. But even if the covariance structure was identical for some magical reason, a PC might have reversed coefficient signs in a different batch because these are arbitrary.

The simplest thing to do would be to do a PCA on all the data simultaneously. If that is too much of a computational challenge, you can do it on a random subset of the data and then apply that PCA to the remaining data. This has been discussed in a number of questions on this site, for e.g. How is PCA applied to new data?

As an aside, I note that you are applying a PCA to binary data. Though this can be done, there is valuable discussion here about what that implies and possible better alternatives:

Doing principal component analysis or factor analysis on binary data

Can principal component analysis be applied to datasets containing a mix of continuous and categorical variables?

mkt
  • 11,770
  • 9
  • 51
  • 125
  • 1
    A little late, but would appreciate if you could clarify this. In the OP, the data was batched such that each batch had the same rows, but different columns. Because of that, it is fairly obvious that PC1 on batch 1 and PC1 on batch 2 represent different things. I don't necessarily see why that is a problem, since the idea is to presumably get _different_ sets of PCs for different subsets of variables and then concatenate them together. The problem of missing correlations would still remain, but would this still be invalid? – tborenst Feb 06 '20 at 15:06
  • @tborenst I have not been on here in a while, so apologies for the late response. You are completely correct, I misinterpreted the question! I have now attempted to answered the actual question. – mkt Mar 30 '20 at 09:45
  • 1
    Thank you very much! That is very helpful. Appreciate you coming back and updating the question. – tborenst Mar 31 '20 at 16:18