What is effect on PCA of having too many zeros in the data?

Question

I want to use Principal Components Analysis to derive dietary patterns. However, my data have many zeros (no intakes) for many observations. I'm unable to find relevant literature to know how biased my results would be as result. There are some papers related to environmental sciences that have described this, but my data are food intakes. Zero intake means no intake. It cannot be substituted by other numbers in contrast to the way it's done in other fields.

Because PCA is almost always used as an exploratory technique--especially in environmental sciences--your concerns may be misplaced. Could you tell us more about what you're trying to use it for? — whuber, Jan 15 '16 at 15:31
Thanks @whuber for comments. I am using PCA to derive patterns of food intake and see association of these patterns with health outcomes. There are a lot of non-consumer (most of the time over half of the sample are non-consumer for many food items). Which means my data has more zeros than the intake figures. Now I am wondering how will it pca analzsis. — student, Jan 15 '16 at 15:46
If you do the PCA and find associations, isn't that enough? Even if PCA were (hypothetically) in some sense highly "biased," if it is carried out on a set of regressors and its results are then compared to a separate set of response variables and found to be correlated, then you will have established an association. — whuber, Jan 15 '16 at 15:54
My concern is not association at the moment but how my derived patterns would be different if I had not high proportion of zeros in my data. My question: If data has a lot of many many zeros. Will it affect the derived pca patterns if yes what can I do. — student, Jan 15 '16 at 16:19
That seems like a useless hypothetical: your data *do* have lots of zeros. They will exhibit whatever patterns they do, and those will necessarily differ from the patterns of somebody else's dataset that happens not to have lots of zeros. If these zeros did not affect the PCA, then PCA would be useless to you. Why do you think you have to do anything about it? Have you tried the PCA and obtained results that preclude further analysis in some way? Please visit http://stats.stackexchange.com/questions/16331 for some discussion of an extreme example of your situation by an expert. — whuber, Jan 15 '16 at 18:16

score 0 · Answer 1 · answered May 18 '21 at 12:46

In case others would find this old post, I do not think this concern is unfounded. Several studies, both in environmental (species counts for example) and non-environmental fields, highlighted issues due to zero-inflated data in standard PCA. Here is an memo on the assumptions of PCA that I find useful: http://alexhwilliams.info/itsneuronalblog/2016/03/27/pca/

Plus some other resources with solutions that do not include substituting zeros by other numbers:

Hellton et al. 2021. The Truth behind the Zeros: A New Approach to Principal Component Analysis of the Neuropsychiatric Inventory. Multivariate Behav Res. 56(1):70-85. https://pubmed.ncbi.nlm.nih.gov/32329370/

Pierson & Yau 2015. ZIFA: Dimensionality reduction for zero-inflated single-cell gene expression analysis. Genome Biol 16, 241. https://doi.org/10.1186/s13059-015-0805-z

Modelling Data with Many Zeros - Principal Component Analysis vs Zero Inflated Models

It should be noted that there is a difference between the types of 0s described in the OP and the kind that are issues in the some of the linked discussions. e.g. in scRNA-seq there are 0s due to measurement error (can't detect low levels of expression) vs the OP described true 0s. — bdeonovic, May 18 '21 at 13:55
@bdeonovic That's entirely true for gene expression, and can indeed influence the choice of method (although I would not call it measurement error but rather limit of quantitation) — Charlotte R, May 26 '21 at 13:48
I would call limit of quantification a type of measurement error :) — bdeonovic, May 26 '21 at 13:56

score 0 · Answer 2 · answered Nov 03 '21 at 12:09

I assume you have non-negative data, as you say (in a comments, should have been in the post itself):

... to derive patterns of food intake and see association of these patterns with health outcomes.

For such data some variant of multiple correspondence analysis might be better. For a paper on this see Use of Multiple Correspondence Analysis and Cluster Analysis to Study Dietary Behaviour: Food Consumption Questionnaire in the SU.VI.MAX. Cohort

Maybe also look into some other ideas for non-negative matrix factorization nnmf?

What is effect on PCA of having too many zeros in the data?

2 Answers2