How to test the effect of experimental condition on a "select all that apply" variable?

Question

People are randomly assigned to participate in multicultural training (treatment) or a control activity. Weeks later, they are given a list of behaviors that could benefit multiple groups: Black Americans, Asian Americans, gays and lesbians, immigrants, and Muslims. They are asked to select all activities in which they would be interested in participating.

What is the best way to analyze the effect of condition on this dependent variable? I can think of a few ways:

1. Run five logistic regression models. Using this approach, I would treat each behavior as a binary variable (1 = yes, want to participate; 0 = no, don't want to participate) on their own. This is straightforward, but the immediate problems are: first, each of these five outcome variables are related to one another, but I treat them as completely separate in this analysis; second, inflated Type I error rate due to multiple comparisons. I could adjust p-values at this point, but I generally find these methods unsatisfying and difficult to choose between one of the many possible approaches. This also doesn't allow the models to "share information" with one another, when they should—because participation for different behaviors are likely to depend on one another.

2. Run a Poisson or negative binomial model. This involves creating a count of how many behaviors the participant would like to partake in, where they can score from 0 (selected none) to 5 (selected all of them). I do not wish to use this approach, because I want to know the granularity of a specific level—not just how many overall they selected. I also do not wish to equate selecting behaviors benefitting Black and Asian Americans (count = 2) with Muslims and immigrants (count = 2, also).

3. Fit a multilevel model. This involves nesting all five variables within an individual. I define a dummy-coded variable at Level 1 (within-person) denoting the target group (e.g., Black, Asian, Muslim, etc.) and another dummy-coded variable at level 2 (between-person) denoting what condition they were in. I define a random slope and intercept within-person. The model looks like:

$Y_{ij} = \beta_{0j} + \beta_{1j}X_{ij} + \epsilon_{ij}$

$\beta_{0j} = \gamma_{00} + \gamma_{01}Z_j + u_{0j}$

$\beta_{1j} = \gamma_{10} + \gamma_{11}Z_j + u_{1j}$

Where $X$ would actually be a 4 dummy-coded variables—I have just left them off for brevity here—and $Z$ represents assignment to condition. As an lme4 formula, this is:

glmer(participate ~ group * condition + (1 + group | id), data, family = binomial)

However, I am not having good luck at getting this model to converge. At about 300 participants and 5 observations per person, I do not believe N is an issue. I'm not sure if there is a particularity about my data that is leading to convergence problems, or if there is a general issue with this model that I am overlooking. I feel as if the model is having a hard time converging on individual estimates for the effect on a ${0, 1}$ outcome for each person (where this might not be as much of an issue if I were using a Gaussian link function with a continuous outcome).

4. Some type of extended chi-square table approach? There is an R package called MCRV, but it seems like this is focused more on examining relationships between multiple "select all that apply" variables, not looking at the experimental effect of one variable on a multiple response categorical variable.

What is the best (with some justification) way to analyze the effect of an experimental condition on a "select all that apply" multiple response categorical variable?

I have seen similar questions asked on CrossValidated, but I have not found the answers to be very helpful (see How to analyse a "Check all that Apply" question, How to test for group differences in a 'select all that apply' question).

score 1 · Answer 1 · answered May 04 '18 at 03:12

Interesting problem. If I understand this correctly, participants can choose one of the following: - A single behaviour; - Two behaviours; - Three behaviours; - Four behaviours; - Five behaviours. If my counting is correct, participants can choose one behaviour in 5 different ways, 2 behaviours in 10 different ways, 3 behaviours in 10 different ways, 4 behaviours in 5 different ways and 5 behaviours in a single way - that means a total of 31 possibilities. It's possible that some of these possibilities are not represented in the data, so you have less than 31 possibilities - even so, that's a lot of possibilities!

If you didn't have so many possibilities, I guess you could have defined your outcome variable as "possibility selected by participant" (e.g., participant # 1 selected behaviours 1 and 2), in which case a multinomial logistic regression might have helped.

So it looks like you might have no option but to collapse information across all these possibilities.

Maybe there is a natural ranking of these behaviours whereby behaviour 5 is the best and behaviour 1 is the worst. Then you could assign ranks to each behaviour (e.g., behaviour 5 receives 5 points and behaviour 1 receives 1 point) and define your outcome variable as the sum of the ranks corresponding to the selected behaviours. Then, in your modelling of this summated rank, you can control for the number of questions selected by the participants by including number of questions as a covariate in your model.

Or maybe you can single out two of the behaviours you think are most important (e.g., behaviours 1 and 2) and then define your outcome variable to have the categories: "behaviour 1 (by itself or in conjunction with any or all of the behaviours 3, 4 or 5 but not in conjunction with behaviour 2)", "behaviour 2 (by itself or in conjunction with any or all of the behaviours 3, 4 or 5 but not in conjunction with behaviour 1)", "behaviours 1 and 2 together (by themselves or in conjunction with any or all of the behaviours 3, 4 or 5), "all other sets of behaviours". Then you have a more maneageable number of mutually exclusive categories which you can model via multinomial regression.

I guess you have to define your outcome variable in a way that reflects your research question while acknowledging the challenges involved in the data. I just wanted to share some ideas in case they might prompt you to think differently about your problem.

How to test the effect of experimental condition on a "select all that apply" variable?

1 Answers1