Modelling Data with Many Zeros - Principal Component Analysis vs Zero Inflated Models

Question

I have a data set (many continuous predictors, single response variable that is also continuous) with many zeros. I first used PCA and found the results to be very helpful. I further thought that the PCA scores remedy the zero problem by assigning the many zero values with numbers - but then I thought perhaps these PCA scores may not be relevant since they are based on a data set containing many zeros. PCA scores are orthogonal, they condense the data by many variables and also remove problems such as co linearity - but is it better to try some form of Poisson Regression or Zero Inflated Models?

I would look at both Zero-Inflated and Hurdle models. – Frank H. Oct 27 '15 at 15:28 — Frank H., Oct 27 '15 at 15:28

score 3 · Answer 1 · answered Oct 27 '15 at 17:14

The zero-inflated data issue is often an issue in community ecology data also. In ordinating sites by species communities, a PCA would result in just clustering of all of the sites near the origin. Thus we typically use distance-based ordinations, in which we calculate similarity/dissimilarity metrics for sites based on species compositions, and perform something like Principal Coordinates Analysis (PCoA) or Non-Metric Multidimensional Scaling (NMDS).

If you are trying for dimensionality reduction, researchers often do use the scores along the PCoA or NMDS axes as new variables too. here's a recent paper that did this using on the 'Bray-Curtis Dissimilarity Metric'.

These methods are pretty straightforward to implement in R - in particular, check out the 'vegan' package.

score 1 · Answer 2 · answered Oct 27 '15 at 15:38

In the world of single cell RNA-seq, methods have been developed that deal with zero-inflated data, and specifically performing PCA.

You could potentially use this method (ZIFA):

http://biorxiv.org/content/biorxiv/early/2015/06/14/019141.full.pdf

standard PCA on zero-inflated data will be very sub-optimal.

Modelling Data with Many Zeros - Principal Component Analysis vs Zero Inflated Models

2 Answers2

Linked