5

I'm looking to run a linear mixed effect model using lme4, where my dependent variable one_syllable_words / total_words_generated is a proportion and my random effect (1 | participant_ID) reflects the longitudinal nature of the design. Independent, fixed effect variables of interest include age, group, timepoint, and interactions between them.

I've come across two main ways to deal with the proportional nature of the DV:

  1. Standard logistic regression / binomial GLM

    In my scenario, I envision the lme4 equation looking like this:

    glmer(one_syllable_words / total_words_generated ~ age + group +
    timepoint + age:timepoint + age:group + timepoint:group + (1 |
    participant_ID), family = "binomial", weights =
    total_words_generated, data = mydat)  
    
  2. Beta regression

    I would apply a transformation to my DV (DV * (n - 1) + .5)/ n) so that it cannot equal 0 or 1. (There are a few instances where it equals zero, no instances where it equals one.)

I'm unclear whether logistic regression or beta regression is preferred in this example. My DV isn't a clear-cut case of successes and failures (unless we stretch the definition of "success"), so I'm worried logistic regression might not be appropriate. However, I'm having trouble getting a firm grasp on beta regression & all it entails. If beta regression is preferred:

  1. Why is it preferred?
  2. What is it doing "behind the scenes" to the data?
  3. How can it be applied in R?
amoeba
  • 93,463
  • 28
  • 275
  • 317
Sarah Smith
  • 143
  • 1
  • 8
  • 2
    +1. My very related Q from a couple of days ago: http://stats.stackexchange.com/questions/233366/ (see my own answer there). – amoeba Sep 14 '16 at 13:52
  • 1
    See also http://stats.stackexchange.com/questions/189115 and http://stats.stackexchange.com/questions/87956 for how to do the binomial model (which I think is the correct approach in your case) and related issues. Both threads are linked to in my answer that I mentioned above - I tried to make that one comprehensive. – amoeba Sep 14 '16 at 14:08
  • Thank you so much for your responses, extremely helpful! Could you provide a bit more information about why you think the binomial model is the correct approach in my case? I also notice that you opted for a slightly different model that used glmerControl and the optimizer = "bobyqa." What is the rationale behind this? – Sarah Smith Sep 14 '16 at 14:22
  • 2
    (1) I think that the binomial model is the correct approach because your data is binomial: if there is some underlying probability of a word being one syllable then the number of one syllable words is binomially distributed. This number can be zero, and this cannot be modeled via beta (and your suggested transformation is just an arbitrary hack). (2) This is the same model! Just a different numerical optimizer that fits this model. I had to use it because the `glmer` function was not converging without it (I have no idea why), and I read that `bobyqa` is more reliable than the default. – amoeba Sep 14 '16 at 14:27
  • So in your case, if option #1 had converged, you would not have opted for option #4? And the fact option #1 did not converge is likely related to overdisperson? – Sarah Smith Sep 14 '16 at 14:32
  • 1
    No-no, these are different issues. I could not use my option #1 because I did not have fractions; my situation is different from yours. You have fractions and hence you have a binomial situation. It is still a good idea to model overdispersion though, see BenBolker's answer linked in my second comment here. You should add one more random term into your `glmer` call, that's it. – amoeba Sep 14 '16 at 14:35
  • Coming around to understanding, really appreciate your help! When you mention adding one more random term to glmer -- would this random term be (1 | index) such that every observation has a unique index? Trying to get at what Ben described when he wrote "if you have a single observation per location then your random effect will handle this." As it stands, I have 3 observations per subject. Alternatively, could I set family = quasibinomial instead of binomial? And should I make these changes to safeguard against overdispersion, or only implement if I've verified overdispersion is a problem? – Sarah Smith Sep 14 '16 at 14:54
  • 1
    Yes, I meant `(1|index)`. With only three observations per subject I am not sure this term will make a lot of difference, but you can still include it and see how it behaves. I think the idea is to include it as a safeguard. Check the variance of this random effect in the fitted model; if it's zero or close to zero then your random subject variance is already enough. Re quasibinomial - you cannot use it in `lme4`, it does not work. So it's not an option. – amoeba Sep 14 '16 at 15:00

0 Answers0