3

I have a data set (many continuous predictors, single response variable that is also continuous) with many zeros. I first used PCA and found the results to be very helpful. I further thought that the PCA scores remedy the zero problem by assigning the many zero values with numbers - but then I thought perhaps these PCA scores may not be relevant since they are based on a data set containing many zeros. PCA scores are orthogonal, they condense the data by many variables and also remove problems such as co linearity - but is it better to try some form of Poisson Regression or Zero Inflated Models?

stats_noob
  • 5,882
  • 1
  • 21
  • 42

2 Answers2

3

The zero-inflated data issue is often an issue in community ecology data also. In ordinating sites by species communities, a PCA would result in just clustering of all of the sites near the origin. Thus we typically use distance-based ordinations, in which we calculate similarity/dissimilarity metrics for sites based on species compositions, and perform something like Principal Coordinates Analysis (PCoA) or Non-Metric Multidimensional Scaling (NMDS).

If you are trying for dimensionality reduction, researchers often do use the scores along the PCoA or NMDS axes as new variables too. here's a recent paper that did this using on the 'Bray-Curtis Dissimilarity Metric'.

These methods are pretty straightforward to implement in R - in particular, check out the 'vegan' package.

mtreg
  • 184
  • 7
1

In the world of single cell RNA-seq, methods have been developed that deal with zero-inflated data, and specifically performing PCA.

You could potentially use this method (ZIFA):

http://biorxiv.org/content/biorxiv/early/2015/06/14/019141.full.pdf

standard PCA on zero-inflated data will be very sub-optimal.

David shaw
  • 82
  • 9