1

Background
Compositional data ($x_i>0, \sum_i x_i=c$) are usually analyzed using some kind of log-transformation (alr/clr/ilr), to take into account naturally the fact that, in presence of the sum constraint, only the scaling of data values are of importance (see here, here and also this answer). A particular obstacle to this approch is posed by the data containing zeros, which requires imputation of these zeros using one of the available imputation strategies - see here. A less inhibiting difficulty is having to work on a simplex, with non-intuitive geometry (although I suppose the intuition appears with practice).

Possible alternative
As a possible alternative is performing a square root transformation, $y_i=\sqrt{x_i}$, which results in the data points being confined to a sphere, $\sum_i y_i^2=c$. One can then introduce distance as cosine distance (or geodesic distance), which permits application of other statistical techniques. The advantages are:

  • spherical geometry (particularly the distances in this geometry) is more intuitive
  • there is no need for imputation of zeros
  • that $y_i^2>$ may facilitate application of methods/algorithms (statistical or numerical) that do not preserve $y_i$ positive.

The particular disadvantage is not taking directly into account the scaling nature of the data.

Question
What are the particular pitfalls that I may be missing here (but which explain why this approach does not seem broadly used)? Are there alternative ways of dealing with compositional data containing zeros?

Update
Another particularity of this approach is that the distance is bounded from above - this could limit application of some statistical techniques, where the variables should be able to span the whole real axis.

Example
The particular application that I have in mind is the gene count data originating from metagenomic analyses of bacterial species. These typically come in the form of count tables, giving a number of times a gene was detected in a sample (more precisely, the number of sequencing reads that mapped onto this gene). The sum over all the genes is referred to as sequencing depth. The zeros may appear either because the corresponding genes are absent from the sample (i.e., the species carrying these genes are absent) or because the genes (species) are present in a very low concentration, undetected at the given sequencing depth.

Update 2
Table 1 of this article presents a range of distance measures used for the analysis of microbial data. In particular, it povides alongside the Aitchison distance the hypesphere-base measures, such as Battacharyya and Hellinger distances.

Roger Vadim
  • 1,481
  • 6
  • 17
  • 1
    Michael Greenacre's book https://www.routledge.com/Compositional-Data-Analysis-in-Practice/Greenacre/p/book/9781138316430 includes gentle propaganda for correspondence analysis as a useful method that doesn't imply the absence of zero. This doesn't solve all problems, but it can be useful. – Nick Cox Apr 12 '21 at 07:49

1 Answers1

1

A few thoughts.

  1. The log transforms you cited are really log transforms of the ratios which you see all through statistics, e.g. logit $=\log(\frac{p}{1-p})$. The log of a ratio linearizes the computation because $\log(\frac{a}{b}) = \log(a) - \log(b)$. This has nice computational properties. If it were only a Box-Cox transform of the raw data (not ratios), then a $\mathrm{sqrt}$ could be just as good as a $\log$, or even better in the presence of zeros.

  2. The log transform takes numbers and ratios on $(0, \infty)$ to $(-\infty, \infty)$. Again, this has nice statistical properties in that the results are on the same support as the Normal distribution and that it easier for solvers to find solutions to unconstrained optimizations. The $\mathrm{sqrt}$ keeps the ratios on $(0, \infty)$.

  3. The problem of zeros in compositional data sometimes involves a Minimum Detectable Level. You will see statisticians overwrite record values below the minimum detectable level as "0" or even "MDL" in their analysis. Neither is correct. The best method is to treat those values as censored so that any likelihood calculation incorporates the uncertainty in the actual experimental procedure.

Nick Cox
  • 48,377
  • 8
  • 110
  • 156
R Carnell
  • 2,566
  • 5
  • 21
  • Thank you for the answer. Could you provide a reference for your point #3 (treating the zeros as censored)? – Roger Vadim Apr 12 '21 at 07:14
  • 1
    A google search on censored compositional data got me [here](https://www.researchgate.net/publication/322164865_Compositional_data_analysis_and_the_zero_problem_interval_censoring_approach) and [here](https://www.researchgate.net/publication/322164870_Modelling_censored_compositional_data_in_the_correct_censored_space) and my favorite censoring [textbook](https://www.amazon.com/Survival-Analysis-Techniques-Truncated-Statistics/dp/038795399X) – R Carnell Apr 12 '21 at 13:13
  • Thanks for giving it a time! – Roger Vadim Apr 12 '21 at 14:04