1

For a data set, I have compositional response variables: probabilities that sum to 1

Why can't I just analyse this using a linear model or alternatively, say, a generalized linear model, where I use the Dirichlet distribution instead?

Currently I am being told that I need to study methods specifically designed for compositional data analysis, and that could take a while. Why is it necessary?

Marggie
  • 21
  • 1
  • 1
    Some similar questiona have been asked, and there are some good answers: http://stats.stackexchange.com/questions/91960/why-is-it-not-ok-to-do-a-pearson-correlation-on-proportion-data/152678#152678 http://stats.stackexchange.com/questions/68944/analysing-data-measured-as-proportional-composition/68954#68954 http://stats.stackexchange.com/questions/55916/generalized-lm-or-lm-in-ecological-dataset/55919#55919 – kjetil b halvorsen Feb 14 '17 at 21:36

1 Answers1

4

You can't use a traditional linear model because they assume that the response variables' values are independent. However, a compositional variables' values are by definition not independent. E.g., if you have three variables, the third category is 1 minus the sum of the first two.

There are situations where it can make sense treat the categories of the responses as if they were multinomial or Dirichlet. However, when you do this you are making some strong assumptions regarding the relationship between the categories. Consider variables that denote types of food eaten: carrots, bananas, chicken, and beef. A vegetarian will always have 0s for the chicken and the beef, this is inconsistent with the assumptions of both the multinomial and Dirichlet.

I'd suggest that you start by doing exploratory analysis of just your response variables to form some basic understanding of their distribution, and then try and find a compositional model that makes sense. If you are really lucky, you will find you can collapse your compositional variables into 2, and then you can use a logistic model, which will make like much, much, easier.

Tim
  • 3,255
  • 14
  • 24
  • http://stats.stackexchange.com/questions/55916/generalized-lm-or-lm-in-ecological-dataset/55919#55919 Here, Glen seems to suggest that with k >2 components, we can use the Dirichlet distribution .... ? – Marggie Feb 14 '17 at 21:45
  • The Dirichlet distribuiton works if the data are actually Dirichlet distributed. If not, then it's a bad option. That's what Tim means when he says that treating the responses as if they arise from a Dirichlet distribution makes some strong assumptions (that may or may not be met in the process that you are studying) – Jacob Socolar Feb 14 '17 at 21:53