2

Taking the correlation of vectors with a constrained, fixed sum (say, a simplex, where sum is always 1) will induce spurious negative correlations, since increasing one element always means decreasing some combination of the others. Is there a canonical approach to dealing with this? What options are available for investigating the correlation of this type of data?

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
Empiromancer
  • 241
  • 1
  • 6

2 Answers2

1

As @kjetil points out in a comment, this is compositional data. A quick google scholar search using this term reveals that there is still quite a bit of active research in this area, and the best alternative to Pearson's correlation for compositional data is still an open question. Here are a few papers on the subject - note that all of them are quite recent:

Many methods rely on a log-ratio transformation of the data. The second reference above lays out the basics of one such approach, along with an R package to do the heavy lifting for you. However, there are also some drawbacks to log ratio transforming the data, such as inability to handle sparse data well.

In summary, there's no cannonical approach, and no matter what approach you choose you'll have to do some work to justify that choice and show that it behaves well for your data.

Empiromancer
  • 241
  • 1
  • 6
1

Many standard multivariate techniques become available for compositional data after deploying a log-transformation (see, e.g., this answer to my own question), e.g., the clr transformtion: $$ clr(x) = clr(x_1, ..., x_k) = \left(\log\left(\frac{x_1}{g(x)}\right), ..., \log\left(\frac{x_k}{g(x)}\right)\right),\\ g(x)=\left(\prod_j x_j\right)^\frac{1}{k}. $$

There are several available log transformations and the choice depends on the particular technique that you ahve in mind (some transformations result inw ell-behaving distances suitable for clustering, but in singular correlation matrices and vice versa).

I also found useful this article.

Roger Vadim
  • 1,481
  • 6
  • 17