20

I am trying to use lme4::glmer() to fit a binomial generalized mixed model (GLMM) with dependent variable that is not binary, but a continuous variable between zero and one. One can think of this variable as a probability; in fact it is probability as reported by human subjects (in an experiment that I help analyzing). I.e. it's not a "discrete" fraction, but a continuous variable.

My glmer() call doesn't work as expected (see below). Why? What can I do?

Later edit: my answer below is more general than the original version of this question, so I modified the question to be more general too.


More details

Apparently it is possible to use logistic regression not only for binary DV but also for continuous DV between zero and one. Indeed, when I run

glm(reportedProbability ~ a + b + c, myData, family="binomial")

I get a warning message

Warning message:
In eval(expr, envir, enclos) : non-integer #successes in a binomial glm!

but a very reasonable fit (all factors are categorical, so I can easily check whether model predictions are close to the across-subjects-means, and they are).

However, what I actually want to use is

glmer(reportedProbability ~ a + b + c + (1 | subject), myData, family="binomial")

It gives me the identical warning, returns a model, but this model is clearly very much off; the estimates of the fixed effects are very far from the glm() ones and from the across-subject-means. (And I need to include glmerControl(optimizer="bobyqa") into the glmer call, otherwise it does not converge at all.)

amoeba
  • 93,463
  • 28
  • 275
  • 317
  • 1
    How about transforming the probabilities first? Can you get something that's closer to normally distributed with say, a logit transformation? Or the arcsin-sqrt? That would be my preference rather than using glmer. Or in your hack solution, you could also try adding a random effect for each observation to account for underdispersion due to your choice of weights. – Aaron left Stack Overflow Sep 05 '16 at 14:27
  • 1
    Thanks. Yes, I can logit the DV and then use Gaussian mixed model (lmer), but this is also a kind of hack, and I've read that it's not recommended. I will try a random effect for each observation! At the moment, I am trying beta mixed model; lme4 cannot handle it, but glmmadmb can. When I run `glmmadmb(reportedProbability ~ a + b + c + (1 | subject), myData, family="beta")`, I get correct fit and reasonable confidence intervals, but a *convergence failed* warning :-/ Trying to figure out how to increase the number of iterations. Beta might work for me because I don't have DV=0 or DV=1 cases. – amoeba Sep 05 '16 at 14:32
  • I don' t know for glmer but for glm this may help: http://stats.stackexchange.com/questions/164120/interesting-logistic-regression-idea-problem-data-not-currently-in-0-1-form/164127#164127: –  Sep 05 '16 at 14:53
  • 1
    @Aaron: I tried adding `+ (1 | rowid)` to my glmer call and this yields stable estimates and stable confidence intervals, independent of my weight choice (I tried 100 and 500). I also tried running lmer on logit(reportedProbability) and I get almost exactly the same thing. So both solutions seem to work well! Beta MM with glmmadmb gives also very close results, but for some reason fails to converge completely and takes forever to run. Consider posting an answer listing these options and explaining a bit the differences and pros/cons! (Confidence intervals that I mention are all Wald.) – amoeba Sep 05 '16 at 14:57
  • @fcop Thanks. The problem is that I don't have the numbers of success/failures for each case; my response variable is not a fraction or a proportion. It's just a probability (confidence) reported by human subject, e.g. a person can report $0.9$ confidence in their choice, and I want to model that. – amoeba Sep 05 '16 at 15:00
  • 1
    And they are absolutely certain about their value like 0.9, or do they also have some ''error margin on it''? Can you assume that confidence reported by different subjects are equally precise? –  Sep 05 '16 at 15:07
  • @fcop (1) Well, they would probably say that they are not "absolutely certain", but in the experiment they are asked to make a binary choice and to report their confidence (as a number); nothing else was recorded, that's the data I have. (2) It would be better not to make this assumption; I am adding a random intercept `(1 | subject)` which is basically allowing different subjects to be under-confident or over-confident. I don't know how to allow the variance for different subjects to be different, so I assume it's the same. – amoeba Sep 05 '16 at 15:11
  • Maybe, assuming that the variance is the same and that $p(1-p)/n$ is the formula for the variance, you may find something for $n$ and then use this to get numbers of successes. If you use weights then they should sum up to your number of observations else it is 'as if' you increase the sample size, hence the impact on standard errors. –  Sep 05 '16 at 15:17
  • So they make a choice and you want to model their confidence in the choice made? You don't want to model the choice using their confidence as a weight? –  Sep 05 '16 at 15:28
  • @fcop I am modeling their choices too (separately), but in this question I am concerned with modeling the confidences. – amoeba Sep 05 '16 at 15:30
  • Only as a comment: You don't mentioned anything about how balanced your dataset is and whether or not you suspect issue with complete separation. If you use Firth logistic regression (`logistf`) do you still experience the same issue? (I tried it with some toy-data response and it seemed fine but clearly your experiment's design/complexity might be much more demanding) – usεr11852 Sep 05 '16 at 22:21
  • @usεr11852: It's not balanced with respect to the levels of factors A, B, and C, if that's what you mean. Some factor combinations have many more trials than some other factor combinations. Complete separation is not an issue. Interesting idea about `logistf`, but it does not support random effects (?), so can't fully help me here in any case. – amoeba Sep 05 '16 at 22:30
  • @amoeba: I mostly mentioned it because you experienced issues with the glm to begin with. – usεr11852 Sep 06 '16 at 20:14

1 Answers1

25

It makes sense to start with a simpler case of no random effects.

There are four ways to deal with continuous zero-to-one response variable that behaves like a fraction or a probability (this is our most canonical/upvoted/viewed thread on this topic, but unfortunately not all four options are discussed there):

  1. If it is a fraction $p=m/n$ of two integers and all $n$s are known, then one can use standard logistic regression, aka binomial GLM. One way to code it in R is (assuming that n is a vector of $N$ values for each data point):

    glm(p ~ a+b+c, myData, family="binomial", weights=n)
    
  2. If $p$ is not a fraction of two integers, then one can use beta regression. This will only work if the observed $p$ is never equal to $0$ or $1$. If it is, then more complicated zero/one-inflated beta models are possible, but this becomes more involved (see this thread).

    betareg(p ~ a+b+c, myData)
    
  3. Logit transform the response and use linear regression. This is usually not advised.

    lm(log(p/(1-p)) ~ a+b+c, myData)
    
  4. Fit a binomial model but then compute standard errors taking over-dispersion into account. The standard errors can be computed in various ways:

    • (a) scaled standard errors via the overdispersion estimate (one, two). This is called "quasi-binomial" GLM.

    • (b) robust standard errors via the sandwich estimator (one, two, three, four). This is called "fractional logit" in econometrics.


    The (a) and (b) are not identical (see this comment, and sections 3.4.1 and 3.4.2 in this book, and this SO post and also this one and this one), but tend to give similar results. Option (a) is implemented in glm as follows:

    glm(p ~ a+b+c, myData, family="quasibinomial")
    

The same four ways are available with random effects.

  1. Using weights argument (one, two):

    glmer(p ~ a+b+c + (1|subject), myData, family="binomial", weights=n)
    

    According to the second link above, it might be a good idea to model overdispersion, see there (and also #4 below).

  2. Using beta mixed model:

    glmmadmb(p ~ a+b+c + (1|subject), myData, family="beta")
    

    or

    glmmTMB(p ~ a+b+c + (1|subject), myData, 
            family=list(family="beta",link="logit"))
    

    If there are exact zeros or ones in the response data, then one can use zero/one-inflated beta model in glmmTMB.

  3. Using logit transform of the response:

    lmer(log(p/(1-p)) ~ a+b+c + (1|subject), myData)
    
  4. Accounting for overdispersion in the binomial model. This uses a different trick: adding a random effect for each data point:

    myData$rowid = as.factor(1:nrow(myData))
    glmer(p ~ a+b+c + (1|subject) + (1|rowid), myData, family="binomial",
          glmerControl(optimizer="bobyqa"))
    

    For some reason this does not work properly as glmer() complains about non-integer p and yields nonsense estimates. A solution that I came up with is to use fake constant weights=k and make sure that p*k is always integer. This requires rounding p but by selecting k that is large enough it should not matter much. The results do not seem to depend on the value of k.

    k = 100
    glmer(round(p*k)/k ~ a+b+c + (1|subject) + (1|rowid), myData, 
          family="binomial", weights=rowid*0+k, glmerControl(optimizer="bobyqa"))
    

    Later update (Jan 2018): This might be an invalid approach. See discussion here. I have to investigate this more.


In my specific case option #1 is not available.

Option #2 is very slow and has issues with converging: glmmadmb takes five-ten minutes to run (and still complains that it did not converge!), whereas lmer works in a split-second and glmer takes a couple of seconds. Update: I tried glmmTMB as suggested in the comments by @BenBolker and it works almost as fast as glmer, without any convergence issues. So this is what I will be using.

Options #3 and #4 yield very similar estimates and very similar Wald confidence intervals (obtained with confint). I am not a big fan of #3 though because it is kind of cheating. And #4 feels somewhat hacky.

Huge thanks to @Aaron who pointed me towards #3 and #4 in his comment.

amoeba
  • 93,463
  • 28
  • 275
  • 317
  • 1
    Nice answer, well explained and connected with the no random effects models. I wouldn't call #3 (the transformation) cheating though, those kinds of transformations are very common in analyses like these. I'd say instead that both #3 and #4 are making assumptions about the relationship about the distribution of the data, and so also about the relationship between the mean and variance, and just because #4 is modeling on the scale that the data was collected on doesn't mean those assumptions are going to be better. – Aaron left Stack Overflow Sep 06 '16 at 17:46
  • 1
    #3 assumes the logit of the probabilities are normal with constant variance, while #4 assumes the variance is proportional to p(1-p). From your description of the fit, these seem to be similar enough to not matter too much. And #3 is almost certainly more standard (depending on your audience) so if the diagnostics are reasonable, that's the one I would prefer. – Aaron left Stack Overflow Sep 06 '16 at 17:48
  • 2
    another possibility is to use [glmmTMB](https://github.com/glmmTMB/glmmTMB); after installing with `devtools::install_github("glmmTMB/glmmTMB",sub="glmmTMB")`, using `glmmTMB(p ~ a+b+c + (1|subject), myData, family=list(family="beta",link="logit"))` should work ... – Ben Bolker Sep 06 '16 at 18:29
  • @BenBolker Thanks! Is there any reason to prefer glmmTMB to glmmADMB (for beta models) or vice versa? Is one of these packages more recent or more actively developed? Apart from that, may I ask what approach among the ones listed in this answer -- gaussian glmm after logit transform, beta glmm, or binomial glmm with (1|rowid) term -- do you find generally preferable? – amoeba Sep 06 '16 at 19:52
  • 1
    I prefer the beta GLMM if feasible - it's the statistical model that's *intended* to measure changes in proportions across covariates/groups. `glmmTMB` is faster and more stable than `glmmADMB`, and under (slightly) more active development, although not as mature. – Ben Bolker Sep 06 '16 at 19:58
  • @BenBolker I installed `glmmTMB` and ran it on my data. It indeed runs relatively fast (not much slower than `glmer`) and does not give me any errors or warnings. However, the resulting fit is very different from the ones I get with `lmer` and `glmer` (whereas those two agree very well). It is a consistent difference in a particular direction: all estimated mean probabilities are smaller (closer to 0.5). I am not sure what diagnostics could tell me which fit is more reasonable; any ideas? – amoeba Sep 07 '16 at 10:50
  • @BenBolker I realized that this discrepancy has nothing to do with GLMM. In a simplest toy example, I have a vector of probabilities `testData`, and `betareg(testData~1)` can estimate the intercept that is quite a bit different from `lm(logit(testData)~1)`, and with `glm(testData ~ 1, family="quasibinomial")` I get yet another value. Admittedly, my test data can have a very broad and skewed distribution. So I guess it boils down to the question of how I want to define the central tendency of such distributions... – amoeba Sep 07 '16 at 11:45
  • I would (1) run all three fits with glmmTMB; (2) look at all aspects of the fit - predicted values, residuals, etc. - to understand what's going on – Ben Bolker Sep 07 '16 at 12:21
  • Let us [continue this discussion in chat](http://chat.stackexchange.com/rooms/45054/discussion-between-ben-bolker-and-amoeba). – Ben Bolker Sep 07 '16 at 15:24