Test repeated (within-subjects) observations of multinomial categorical data?

Question

Over the course of 30 days I have asked 47 people (24 from group A and 23 from group B) which of four foods they prefer, making up a total of 1410 observations:

     choice
group apple orange pizza beer
    A   340     63   216  101
    B   424     65   125   76

Because I have asked the same person multiple times, the observations (within each group) are not independent and I cannot use a chi-squared test to compare the distributions.

What I want to know is: Which foods are chosen significantly more often by one group than the other? My hypothesis is that group A prefers pizza and beer, while group B prefers fruits. I assume that the preference does not change over (such a short) time and am not interested in the longitudinal aspect of the survey.

What test can I use?

Attempt at a solution:

Basically, the repeated measures (of the same person) are something like repeatedly measuring the length of a stick to obtain a more accurate measurement and average out measurement errors. I therefore thought that for each person I might calculate the percentage of each answer category. Thus, 100% of answers of one person would then divide into, for example, 40% apple answers, 30% orange, 20% pizza, and 10% beer. Represented as probabilities (that sum up to 1 for each person), I would then have data like this::

person group apple orange pizza beer
     1     A   0.4    0.3   0.2  0.1
     2     B   ...

In this way I would have "deleted" the within-person interdependence and would then perform a t-test on the resulting two numeric vectors.

But I am unable to judge whether this is a valid procedure for the kind of data I have. Also, I would prefer to use a published and reviewed test, if such a one exists.

Sample data:

food <- c("apple", "orange", "pizza", "beer")
dat <- data.frame(
                  group  = rep(c("A", "B"), c(720, 690)),
                  choice = c(
                             rep(food, c(340, 63, 216, 101)),
                             rep(food, c(424, 65, 125, 76))
                            )
                 )
tab <- table(dat)

I think longitudinal categorical analysis fits your need. I'm not very familiar with the subject myself, so I'm sorry I can't help more. [CatGEE](https://faculty.washington.edu/heagerty/Courses/VA-longitudinal/private/CatGEE.pdf) — Jirapat Samranvedhya, Aug 22 '17 at 16:52
Thank you, @JirapatSamranvedhya, my question was phrased somewhat unclearly. I am not interested in the longitudinal factor. I edited my question. — , Aug 22 '17 at 18:22
The number of observation in the table adds up to 1410. Where did the extra 120 observations go? — rinspy, Aug 25 '17 at 07:43
So, you have factor Group, random factor Respondent (nested in it) and repeated measures (the source of error, not a treatment factor). Why not perform multinomial regression with dependent Food and predictors random Respondent nested in fixed Group? RM-measures - you unwrap them into "long format", i.e. different rows of data - don't enter as factor anyhow: it will remain the error term. — ttnphns, Aug 25 '17 at 08:33
@ttnphns Could you write out the R code of that model for me? I'm not sure how to do "nested in". — , Aug 25 '17 at 10:00
https://stats.stackexchange.com/questions/298019/test-individual-categories-in-a-contingency-table-for-significance/299299#299299 can these be referred to each other or merged? — Sextus Empiricus, Aug 31 '17 at 15:22
@MartijnWeterings I wouldn't know. This question asks about testing the whole distribution, the other question asks about testing the individual categories. If I knew the answer to both questions, I would know if they are the same question or different questions. — , Aug 31 '17 at 16:04
Merging might be indeed wrong since they are different questions with different answers. Yet, it might be useful if you refer to related questions such that it is clear in the new question what you tried before, and in the old question what your are doing now. — Sextus Empiricus, Aug 31 '17 at 16:14

score 2 · Accepted Answer · answered Aug 30 '17 at 11:59

2

Edit: I just saw that this is what @ttnphns proposed in the comments.

I think a good approach for your data would be multinomial regression. You can find details on how to do that in this question:

Can I use glm algorithms to do a multinomial logistic regression?

Basically you 'll use a mixed-effects GLM with poisson distribution and the log link. The subject will be the random effect.

answered Aug 30 '17 at 11:59

vkehayas

701
5
13

Thank you, @vkehayas. Could you maybe write out the model (or the R code) for me? I'm not very good with constructing regression models. – Aug 31 '17 at 16:03
I'm afraid I am not versed in R - I've used Matlab so far for this type of problems. What I suggested should look something like this: 'food ~ 1 + group + (1|subject)' as the model, with family(poisson). I think @ttnphns suggested using this model: 'food ~ 1 + group + (group|subject)' . Good luck! – vkehayas Aug 31 '17 at 20:16

score 1 · Answer 2 · edited Aug 25 '17 at 10:33

1

I'd suggest Cochran-Mantel-Haenszel chi-squared test (mantelhaen.test function from base R).

In your data, you have 30 strata (one for each day) and CMH allows you for taking into account possible variability (among strata) of group-choice relationship.

See examples on ?mantelhaen.test.

edited Aug 25 '17 at 10:33

answered Aug 25 '17 at 10:03

Łukasz Deryło

3,735
1
10
26

Nice. I'll look into that test over the weekend, but at a quick glance it looks like what I need. Thank you. – Aug 25 '17 at 10:10
1

I can't see how that can help. CMH is for independent stratas. Further, it is for dichotomous response. – ttnphns Aug 25 '17 at 13:00
`mantelhaen.test` can be used with nxn table (so called generalized CMH), see last example on help page. I'm not sure about independence of strata, I'll try to check it. – Łukasz Deryło Aug 25 '17 at 15:05
In the book from which the examples on the help page to the R implementation of the CMH test are taken, that test is only used for independent stratas, as ttnphns has commented. There are procedures in that book for repeated observations of the same subjects, and they all restructure the data in the way I have explained in my "answer". – Aug 27 '17 at 14:14
OK, seems that CMH is not perfect here. – Łukasz Deryło Aug 27 '17 at 15:36

score 0 · Answer 3 · 2017-08-27T16:10:48.807

This is merely an extended comment to the answer by Łukasz Deryło.

In the R help for ?mantelhaen.test, suggested by Łukasz Deryło in his answer, a reference is given to Agresti, A. (2002). Categorical Data Analysis (2nd ed.). Hoboken: Wiley (link to PDF). I went through that book, and from it I took the following solution:

First, we need to represent the data differently. Instead of 30 observations for 47 people:

person day   food
     1   1   beer
     1   2 orange
     1   3   beer
   ...

we can think of each person replying with a certain response pattern made up from 30 elements. This pattern is the sequence of foods chosen on the thirty consecutive days. For one person, this response pattern would look like this:

                day
person group      1    2    3     4    5 ...    30
     1     A orange beer beer apple beer ... pizza

In a next step, we list all 4³⁰ possible response patterns and indicate whether or not a participant displayed that pattern. In this representation of the data, each participant will have twentynine "0" and one "1" in his or her row. For the sake of saving space, I represent each food with its first letter (the first sequence of first letters represents the pattern "'apple' chosen on every day", the second sequence of first letters represents the pattern "'apple' chosen on the first 29 days, 'orange' chosen on the last day"):

             pattern
person group aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaao ...
     1     A                              1                              0
     2     B                              0                              0

There are one quintillion possible patterns (4³⁰ = 1152921504606846976), so I hope you will forgive me that I only show the first two. But I am sure you get the idea.

Finally, we will calculate the column sums for each group and get a new representation of the data that looks like this:

             pattern
group aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaao ...
    A                              3                              0
    B                              1                              2

This means that three participants from group A have chosen apple on all thirty days, compared to only one person from group B who displayed that response pattern. No one from group A chose apple on the first twentynine days and orange on the last day, compared to two person from group B. And so on.

Given this representation of the data, we can now fit a simple logit model of the form logit[P(Y = 1)] = α + β₁g + β₂p, where g = {A, B} are the groups and p are the 4³⁰ response patterns.

I understand that I would need to survey more people than there are on this planet for that regression analysis to turn out meaningful results, which makes this procedure impractical in my situation, but I suspect that theoretically this would be a way to get around the repeated observations from the participants. Agresti gives examples with three repeated measures of a binary answer, which result in eight response patterns, e.g. (from page 487):

I couldn't understand why you got interest in _response patterns_. You said `does not change over (such a short) time and am not interested in the longitudinal aspect of the survey`, that sounded like you treat your 30 repeated measures as simply the source of error term, they 30 are just multiple asking a man because if you ask only once and he's not in mood or is drunk that day he might mix up or lie the answer. That's how I took your question. — ttnphns, Aug 27 '17 at 14:41
Yes, @ttnphns, but there is a (probably) very high correlation between observations from the same person, so these observations are not independent, and the procedure described above gets rid of that interdependence between observations, by transforming the sequence of observations into response patterns. The whole point of the question is overcoming the fact that there are repeated measures and therefore observations are not independent. If they were, I could use the CMH test. — , Aug 27 '17 at 14:58

Test repeated (within-subjects) observations of multinomial categorical data?

3 Answers3