Log-transformation of Compositional Data

Question

I am dealing with compositional data, in a high dimension.

Each sample I have behaves like:

$$ {S}^D=\left\{\mathbf{x}=[x_1,x_2,\dots,x_D]\in\mathbb{R}^D \,\left|\, x_i>0,i=1,2,\dots,D; \sum_{i=1}^D x_i=1 \right. \right\} $$

In order to embed the high-dimensional data for visualization into a low-dimensional space of two or three dimensions, I use different methods with respect to $Euclidean$ distances, for example t-SNE

In order to maintain distances with respect to Aitchison geometry I use the Central Logratio Transformation (CLR) before applying the dimensional reduction:

$$ \operatorname{clr}(x) = \left[ \log \frac{x_1}{g(x)} \cdots \log \frac{x_{D-1}}{g(x)} \right] $$

where $ g(x) $ is the geometric mean of the sample.

$ clr $ has shown significant improvement in visualizing the data, and preserving its natural patterns (measured by tightness of pre-known clusters in the data).

However, I get very similar improvements by simply applying $Log$ transformation to the data:

$$ \operatorname{log}(x) = \left[ \log x_1 \cdots \log x_{D-1} \right] $$

$Log$ captures a lot of the essense of the $clr$, but I want to prove that $clr$ is the right way to go when trying to preserve $Euclidean$ distances in the data.

To try that, I have tested the 2D case of two points along the Aitchison simplex:

$$ A = \left[0.1, 0.9 \right], C = \left[0.9, 0.1 \right] $$

In order to move from point $A$ to $C$ along the simplex I have to traverse via point $ B = \left[0.5, 0.5 \right] $:

In the $clr$ space the Euclidean distances are preserved in a way that:

$$ d(clr(A), clr(B)) + d(clr(B), clr(C)) = d(clr(A), clr(C))$$

However in $Log$ space we get the following behavior:

$$ d(log(A), log(B)) + d(log(B), log(C)) > d(log(A), log(C))$$

Which indicates the $Log$ can distort the Euclidean distances in a way that might create problems.

Is there a better way to prove that $clr$ or similar transformation that maintain Aitchison geometry are superior in such cases?

Your question is confusing, because neither the log nor the CLR--nor, indeed, any inherently nonlinear transformation--can preserve the Euclidean metric. — whuber, Mar 19 '19 at 16:47
I would say that one important aspect of the `clr` in contrast to the `log` is that the transformation is scale-invariant, i.e. it is not important if compositions in your sample sum up to 1, to 100 or whatever. — marc1s, Mar 25 '19 at 15:02

Log-transformation of Compositional Data

0 Answers0