Questions tagged [compositional-data]

Refers to variables representing fractions of a total, i.e. all lying in $[0,1]$ interval and necessarily summing to one. Analysis of such data is often called compositional data analysis.

Compositional data pertain to the relative proportions of a whole. For example,

  • Each data point may correspond to a rock composed of three different minerals; a rock of which 10% is the first mineral, 30% is the second, and the remaining 60% is the third would correspond to the triple [0.1, 0.3, 0.6]; a data set would contain one such triple for each rock in a sample of rocks.
    (Wikipedia)

This is subtly different from binomial or multinomial data, which can also yield proportions that lie exclusively in $[0, 1]$ (or could be represented as counts out of a total), but where the proportions come from discrete events that could have been one category or another.

The analysis of such data requires special methods, mostly based on log ratios.

Additional resources can be found at http://www.compositionaldata.com/

137 questions
27
votes
3 answers

How to perform isometric log-ratio transformation

I have data on movement behaviours (time spent sleeping, sedentary, and doing physical activity) that sums to approximately 24 (as in hours per day). I want to create a variable that captures the relative time spent in each of these behaviours -…
20
votes
3 answers

Do I need to drop variables that are correlated/collinear before running kmeans?

I am running kmeans to identify clusters of customers. I have approximately 100 variables to identify clusters. Each of these variables represent the % of spend by a customer on a category. So, if I have 100 categories, I have these 100 variables…
17
votes
1 answer

What are some distributions over the probability simplex?

Let $\Delta_{K}$ be the probability simplex of dimension $K-1$, i.e. $x \in \Delta_{K}$ is such that $x_i \ge 0$ and $\sum_i x_i = 1$. What distributions which are frequently (or well-known, or defined in the past) over $\Delta_{K}$ exist? Clearly,…
16
votes
2 answers

Can I use the CLR (centered log-ratio transformation) to prepare data for PCA?

I am using a script. It is for core records. I have a dataframe which shows the different elemental compositions in the columns over a given depth (in the first column). I want to perform a PCA with it and I am confused about the standardization…
T.rex
  • 161
  • 1
  • 1
  • 3
15
votes
2 answers

Clustering of very skewed, count data: any suggestions to go about (transform etc)?

Basic problem Here is my basic problem: I am trying to cluster a dataset containing some very skewed variables with counts. The variables contain many zeros and are therefore not very informative for my clustering procedure - which is likely to be…
13
votes
4 answers

What test to compare community composition?

Hope this newbie question is the right question for this site: Suppose I would like to compare the composition of ecological communities at two sites A, B. I know all three sites have dogs, cats, cows, and birds, so I sample their abundances at each…
11
votes
4 answers

Why is it not OK to do a Pearson correlation on proportion data?

An online module I am studying states that one should never use Pearson correlation with proportion data. Why not? Or, if it is sometimes OK or always OK, why?
11
votes
3 answers

Why is isometric log-ratio transformation preferred over the additive(alr) or centered(clr) with compositional data?

I'm doing linear regression on compositional data using log-ratio transformation with census data. The IVs are compositional (percents summing to 100). The DV is non-compositional and continuous. The alr and clr results are more easily interpreted.…
M Kearny
  • 111
  • 1
  • 7
9
votes
2 answers

What are the differences between Dirichlet regression and log-ratio analysis?

Compositional data can be analyzed by either Dirichlet regression or using log-ratio analysis as pioneered by John Aitchison. My questions are What are the main differences in assumptions between these two models? When should you prefer one above…
Marke
  • 141
  • 1
  • 2
8
votes
1 answer

Problems with time series prediction

I got a question about modeling time series in R. my data consist of the following matrix: 1 0.03333333 0.01111111 0.9555556 2 0.03810624 0.02309469 0.9387991 3 0.00000000 0.03846154 0.9615385 4 0.03776683 0.03119869 0.9310345 5 0.06606607…
karmabob
  • 125
  • 5
7
votes
1 answer

How to use isometric logratio ilr() from a package "compositions"

I have an environmental dataset, where observations do not sum up to 1. I suspect that data are a subcomposition, meaning that not all elements have been measured and that is why observations do not sum up to a constant. I would like to apply…
marianess
  • 163
  • 1
  • 9
6
votes
1 answer

Multivariate data analyis of compositional data

Suppose I have a multivariate, compositional dataset that depicts the concentration of different elements. However, the data are not available on a single scale; i.e., some are of form 0.00x while others are integers. Should I apply any kind of…
user41728
  • 61
  • 4
6
votes
1 answer

Peanut butter jars full of river mud and bacteria?

I'm an environmental scientist looking into dynamics of bacteria growth in river bed sediments. I collected lots of data, and used regression for most of the comparisons, but one (the most important) is giving me fits: I'm trying to figure out if…
5
votes
1 answer

Possible classification techniques to use when each feature is a probability distribution

I am working with some data where the features have a temporal aspect (e.g. how often does a feature occur between $t_{begin}$ and $t_{end}$). I am trying to build a binary classifier for this data. The problem, however, is that each feature is a…
5
votes
2 answers

Predicting proportions with Machine Learning

I am working on a machine learning problem where I have to predict a set of $N$ numbers (proportions) for each data point, all of them summing to one. One toy example to illustrate my problem would be predicting at a daily level the percentage of…
1
2 3
9 10