24

I have a repeated-measures experiment where the dependent variable is a percentage, and I have multiple factors as independent variables. I'd like to use glmer from the R package lme4 to treat it as a logistic regression problem (by specifying family=binomial) since it seems to accommodate this setup directly.

My data looks like this:

 > head(data.xvsy)
   foldnum      featureset noisered pooldur dpoolmode       auc
 1       0         mfcc-ms      nr0       1      mean 0.6760438
 2       1         mfcc-ms      nr0       1      mean 0.6739482
 3       0    melspec-maxp    nr075       1       max 0.8141421
 4       1    melspec-maxp    nr075       1       max 0.7822994
 5       0 chrmpeak-tpor1d    nr075       1       max 0.6547476
 6       1 chrmpeak-tpor1d    nr075       1       max 0.6699825

and here's the R command that I was hoping would be appropriate:

 glmer(auc~1+featureset*noisered*pooldur*dpoolmode+(1|foldnum), data.xvsy, family=binomial)

The problem with this is that the command complains about my dependent variable not being integers:

In eval(expr, envir, enclos) : non-integer #successes in a binomial glm!

and the analysis of this (pilot) data gives weird answers as a result.

I understand why the binomial family expects integers (yes-no counts), but it seems it should be OK to regress percentage data directly. How to do this?

amoeba
  • 93,463
  • 28
  • 275
  • 317
Dan Stowell
  • 1,262
  • 1
  • 12
  • 22
  • 1
    It doesn't seem OK to me, as 5 out of 10 isn't the same information as 500 out of 1000. Express the response as one count of the no. "successes" & one count of the no. "failures". – Scortchi - Reinstate Monica Feb 26 '14 at 14:03
  • @Scortchi thanks, I think you may be right. I was thinking in part about the continuous nature of my percentages (derived from probabilistic decisions) similar to this question: http://stats.stackexchange.com/questions/77376/generalized-linear-models-with-continuous-proportions but I believe I can express my data via a meaningful conversion to integer counts. – Dan Stowell Feb 26 '14 at 14:46

2 Answers2

24

In order to use a vector of proportions as the response variable with glmer(., family = binomial), you need to set the number of trials that led to each proportion using the weights argument. For example, using the cbpp data from the lme4 package:

glmer(incidence / size ~ period + (1 | herd), weights = size,
   family = binomial, data = cbpp)

If you do not know the total number of trials, then a binomial model is not appropriate, as is indicated in the error message.

amoeba
  • 93,463
  • 28
  • 275
  • 317
Steve Walker
  • 731
  • 5
  • 4
  • I can't say whether using weights for this works or not. But you certainly can input the data as a two column matrix (successes/failures) on the left hand side of the formula. – ndoogan Feb 26 '14 at 22:15
  • But @ndoogan, the original question was about proportions, not successes/failures. And the above code does work, as I took it from the `cbpp` help page. – Steve Walker Feb 26 '14 at 23:37
  • Fair enough. Though, I intended to mean successes/failures (*not* intended to be division) is where the proportions for a binomial model come from. – ndoogan Feb 27 '14 at 02:45
  • +1 but readers might want to see @BenBolker's answer here http://stats.stackexchange.com/questions/189115 about possible ways to deal with overdispersion. – amoeba Sep 14 '16 at 14:12
9

If your response is a proportion, percentage or anything similiar that can only take values in $(0,1)$ you would typically use beta regression, not the binomial one.

amoeba
  • 93,463
  • 28
  • 275
  • 317
M. Berk
  • 2,485
  • 1
  • 13
  • 19
  • 3
    A binomial model is a model of proportions. Though, it's only appropriate when you know the number of trials. If all you have is a percent with no indication of the number of trials, then I believe you are correct that beta regression is appropriate. – ndoogan Feb 27 '14 at 02:47
  • @ndoogan To clarify, my advice is not "use beta regression when your response is a proportion" but rather "if your response can only take values in $(0,1)$ _such as_ proportions/percentages then beta regression is typical" – M. Berk Feb 27 '14 at 10:51
  • Thanks, this is a good point. I'm accepting the other answer because it answers the question as written, but the point about beta regression is well made so I've upvoted it. – Dan Stowell Feb 27 '14 at 12:37