Grouping observations based on variables that sum to one

Question

I have a problem where I am trying to group observations (most likely using k-means or a similar unsupervised learning tool) where each observation includes n-variables, with the total sum of these variables equal to one. We are therefore grouping observations, based on the probabilities of each of n potential states or outcomes for each observation.

For example: if we are testing the manufacturing of dice, and we take a sample of 1000 dice, each die would be one observation and we would record V1 = the percentage of the time that a one appeared when we rolled that particular die, V2 = % two, etc. We would then cluster the die based on the percent of the time that each die appeared as a 1, 2, 3, etc (to see if, for example, one of the die machines was improperly). While we could look at the percentages, it is perceived that clustering these observations will help point to any under-lying trends (colors, materials, machines, etc) that would not be apparent without the unsupervised learning techniques.

I know this is a bit of a strange problem, but I'm sure some literature exists for it. What is this type of problem called? Can you point me in the direction of some work that has been done to study this type of problem??

score 0 · Answer 1 · answered Sep 01 '18 at 10:24

Data (like yours) representing fractions of a total, all in the interval $[0,1]$ and summing to 1, is called compositional data and have a tag [compositional-data], see https://stats.stackexchange.com/search?q=%5Bcompositional-data%5D+answers%3A1+closed%3Ano

One main problem with such data is that it cannot really be homoskedastic, since fractions close to 0 or very large must have smaller variance (by the same reason as for binomial or multinomial data). A technique often used is to use a so-called log-ratio transformation, see How to perform isometric log-ratio transformation or What are the differences between Dirichlet regression and log-ratio analysis?.

There is many published papers with clustering of transformed compositional data, you can start with https://arxiv.org/pdf/1704.06150.pdf

Grouping observations based on variables that sum to one

1 Answers1