Deal with percentage data

Question

For instance, I have such data:

PatientID    marker1 marker2 marker3 marker4
1             10%      25%    25%     40%
2             20%      15%    10%     45%
3             15%      20%    25%     40%
4             25%      25%    20%     30%
...

Each patient tissue is tested for four different but markers. The data points for each marker is the percentage of it among the four markers. So, each four data points add up to 100%.

Given such data, for instance, I might compare these four groups using ANOVA or T-test for each of them combined with Bonferroni correction? But what to do to deal with the dependent observations as four percentages added up to 100% for each individual.

The reason I did not write my goal or specific research question regarding these data points is not because I don't have any one, is only because I don't want to limit the freedom of thinking of data due to my fixed problem question (for instance, I only want to see the differences between marker1 and marker2, etc.) Therefore, any good idea to play with data is welcome.

One use ANOVA and t-test usually in a situation when comparing few groups of cases. It seems that here you don't have groups, but four continuous variables. I believe it is rather linear model that you should use. Could you clarify this, please? — PtrZlnk, Apr 14 '17 at 13:00
"Deal with" is rather vague. Some clarity on what your hypothesis is would be helpful. — Ashe, Apr 14 '17 at 13:16
You should not ask "I have this data, what can I do with it?" Instead, you should tell us what you want to do and what your specific question is. — Peter Flom, Apr 14 '17 at 13:29
@PeterFlom Why I start my question like this is only because I suppose if I have such data in my future project, and I need to work out a story based on these data in order to complete a manuscript if possible. So, just to play with it to think of some good ideas how to use these data and how to analyze correctly. Is my point clear for you? Of course, thank you for your comment and your time. — juanli, Apr 14 '17 at 15:01
@PeterFlom Now I am wondering if it will be very fun and brainstorming as well to make up a good story about a group of data (some strange data probably). It's my routine research work daily. Very enjoyable. — juanli, Apr 14 '17 at 15:10
@PtrZlnk Well, I know the data format might not directly be used for inputting into statistical formula for analysis. Sure, I need a group variable separately, together with the continuous variables if I want to conduct a ANOVA test for instance. But, my point is about how to address such data points (which is mentioned as compositional data in below). But the way, are you sure about using ANOVA for the compositional data? — juanli, Apr 18 '17 at 12:52
In exploring the topic I came across some interesting stuff: this 95-page-long ["lecture" by Pawlowsky-Glahn](http://dugi-doc.udg.edu/bitstream/handle/10256/297/CoDa-book.pdf?sequence=1); the [CoDaWeb](http://www.compositionaldata.com) of the University of Girona (Catalonia); and their very friendly [free app](http://www.compositionaldata.com/codapack.php). — Antoni Parellada, Apr 19 '17 at 18:31
Also, I found very interesting [this post on CV](https://stats.stackexchange.com/a/244446/67822) by @marc1s. — Antoni Parellada, Apr 19 '17 at 18:32
@AntoniParellada Many thanks. I find it very necessary and also enjoyable to spare some time for this topic. — juanli, Apr 19 '17 at 19:07
@Amy Is there any way you can define a bit more concretely what you have in mind? For example, I am having trouble applying the concept of groups to your dataset - in [this paper](http://www.idescat.cat/sort/sort392/39.2.4.martin-etal.pdf), the Girona researchers compare the % of time spent in different daily activities across different **groups** (gender, day of the week, socioeconomic status)... — Antoni Parellada, Apr 19 '17 at 19:52
@AntoniParellada An early post by me invovling the compositional data. Now, I realize is it alright just to perform regression analysis (even penalized regression) by removing one column suggested. https://stats.stackexchange.com/questions/244045/sum-of-all-covariables-value-per-patient-is-1 — juanli, Apr 20 '17 at 11:58
@AntoniParellada In that old post, my question is about to regression the y value (patients weight change) against those different bacterial percentage (relative percentage, the sum is 100%), also including other patients' characteristics such as age, gender. Now I need to rethink that case. — juanli, Apr 20 '17 at 12:03
@Amy I have made some progress in this regard, and was thinking about posting an example with just hands-on R-coded analysis, paralleling your linked question. It may be too basic for you, and I won't get around to it until later. But I thank you for your question - this was an area I hadn't explored before. — Antoni Parellada, Apr 20 '17 at 12:05
@AntoniParellada I just came across such data when my supervisor asked me for help on analyzing those data one time. Good proposal to post an illustrated example with detailed R code, which will be more attractive and useful. — juanli, Apr 20 '17 at 12:34

score 4 · Answer 1 · answered Apr 14 '17 at 13:13

4

This kind of data is known as compositional data, and you might find this interesting summary of transformation techniques to be helpful. You designate one of the markers as the baseline and it won't be used directly in any analysis, though you can back out its results in the end.

answered Apr 14 '17 at 13:13

Wayne

19,981
4
50
99

1

I will definitely look into this so-called "compositional data". Many thanks. – juanli Apr 14 '17 at 15:07

Deal with percentage data

1 Answers1

Linked