The average of two counts as a non-integral dependent variable in negative binomial models?

Question

I analyse the effect of a protein level including confounding variables (age and gender) on a joint score. The joint score = each joint in the hand is evaluated and all values are summed. This joint scoring is performed by two independent physicians and both scores are averaged. There is no missing value. The distribution follows a negative binomial assumption and I have designed this model.

mod <- lme4::glm.nb(score ~ age + gender + protein_level, 
                    data = d01)

Data frame with averaged scores.

d01 <- data.frame(subject = 1:10, 
                  gender = c("female", "female", "male", 
                  "female",  "female", "female", "male", 
                  "female", "female", "male"),
                  age = c(74, 78, 62, 62, 77, 66, 66, 52, 60, 
                         60), 
                  protein_level = c(1.23, 1.07, 12.79, 1.75, 
                        11.63, 10.13, 0.89, 7.18, 1.23, 
                        0.29), score = c(18, 9, 30, 24.5, 41, 
                        54.5, 2.5, 11.5, 21.5, 5.5))

However, the averaged values are non-integral and I also receive this warning.

Warning messages: 1: In dpois(y, mu, log = TRUE) : non-integer x = 0.500000, etc.

I suppose that the possibilities below could work:

Ignore this warning, the model works
Instead of the average, use the sum of scores. However, I need the average score for a publication. Could I divide model coefficients (the estimated mean and confidence intervals) by two (two physicians)?
I found that it is possible to use offset(log()), but I do not know how to apply it to my model.

Could you please recommend to me the best one or suggest a better one?

I thank everybody for an answer in advance.

I also attach the data frame with scores by both physicians.

d02 <- data.frame(subject = rep(1:10,2),
                  eval = rep(c("eval1","eval2"), each = 10),
                  gender = rep(c("female", "female", "male", 
            "female", "female", "female", "male",  "female", 
           "female", "male"), 2), age = rep(c(74, 78, 62, 62, 
           77, 66, 66, 52, 60, 60), 2), protein_level = 
           rep(c(1.23, 1.07, 12.79, 1.75, 11.63, 10.13, 0.89, 
            7.18, 1.23, 0.29), 2), score = c(17, 8, 30, 24, 
           40, 43, 2, 11, 21, 6, 19, 10, 30, 25, 42, 66, 3, 
           12, 22, 5))

Could you elaborate on why you need to use a negative binomial model? It really doesn't seem like the scores do follow a negative binomial distribution. — awhug, Jun 30 '21 at 11:05
awhug, thank you for your response. d01 and d02 are only small sets of my dataset, which contains over 150 patients. The joint score values are always positive whole numbers (for one evaluator) and their histogram is right-skewed. Additionally, car::qqp() enables identification of the distribution and it suggests negative binomial or gamma distributions. Since there are few zeros (=healthy controls), I suppose the negative binomial model is the best. — Osgarion, Jun 30 '21 at 13:24
Great! There's been a number of threads on using non-integer values in count GLMs (see [here](https://stats.stackexchange.com/questions/223160), [here](https://stats.stackexchange.com/questions/38530), and [here](https://stats.stackexchange.com/questions/340583)). If you want to model the average physician scores, a gamma GLM sounds like it *might* be more appropriate to me. But on the distribution - note the [distribution of the model residuals may be more important than that of the outcome alone](https://stats.stackexchange.com/questions/515444). — awhug, Jul 01 '21 at 05:32
awhug, thank you again. I will try the gamma GLM. I found that I could add one point to all values to avoid any zero. You are right; my final decision would be on residual distribution and the homogeneity of variance. Thank you. — Osgarion, Jul 01 '21 at 08:18
Thank you kjetil, you are right. I apologize for my mistake. — Osgarion, Jul 09 '21 at 09:12

score 1 · Answer 1 · answered Jul 05 '21 at 21:50

You could fit it, with mean of counts, using quasi-likelihood. This is often done with Poisson regression in R, using the quasipoisson family. There is no such family for negative binomial regression, but there is a general quasifamily you can use, just give it the link and variance function used with negative binomial regression.

But, why not use the two counts, obtained by two different evaluations? Then you can model eval (from your second data.frame d02) as a random effect. Trying this, with

mod <- lme4::glmer.nb(score ~ age + gender + protein_level + (1|eval), 
                    data = d02)

gives a warning about nearly unidentified model, but I understand you only have given a subset of your data. The estimated model is

 summary(mod)
Generalized linear mixed model fit by maximum likelihood (Laplace
  Approximation) [glmerMod]
 Family: Negative Binomial(7.495)  ( log )
Formula: score ~ age + gender + protein_level + (1 | eval)
   Data: d02

     AIC      BIC   logLik deviance df.resid 
   148.4    154.4    -68.2    136.4       14 

Scaled residuals: 
     Min       1Q   Median       3Q      Max 
-1.63510 -0.71656  0.08666  0.72334  1.28905 

Random effects:
 Groups Name        Variance  Std.Dev. 
 eval   (Intercept) 1.075e-12 1.037e-06
Number of obs: 20, groups:  eval, 2

Fixed effects:
               Estimate Std. Error z value Pr(>|z|)    
(Intercept)    2.835954   0.962648   2.946  0.00322 ** 
age           -0.003783   0.014313  -0.264  0.79151    
gendermale    -0.966123   0.246037  -3.927 8.61e-05 ***
protein_level  0.117626   0.020316   5.790 7.05e-09 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Correlation of Fixed Effects:
            (Intr) age    gndrml
age         -0.987              
gendermale  -0.334  0.303       
protein_lvl -0.002 -0.104 -0.220
optimizer (Nelder_Mead) convergence code: 0 (OK)
boundary (singular) fit: see ?isSingular

Interestingly enough,the random effects are basically zero---maybe quasi-likelihood is a better option?

Thank you kjetil for your recommendations. I did not realize such an opinion. However, it was my basic model which I have adjusted for another confounding factor (bmi) and additional measurement of each patient in next two examinations. Thus, my last model looks like this `lme4::glmer.nb(score ~ offset(log(age)) + gender + offset(log(bmi)) + examination + protein_level + (1|subject) + (1|eval), data = d03)`. Unfortunately, DHARMa diagnostics revealed deviation in this model. Should I consider another approach, e.g. generalized additive models? Thank you very much for your assistance. — Osgarion, Jul 09 '21 at 09:46
You could certainly try with gam's, R's mgcv package supports negative binomial models, and also random effects: https://rdrr.io/cran/mgcv/man/negbin.html — kjetil b halvorsen, Jul 13 '21 at 02:01
@Osgarion, I agree w/ Kjetil. I would just use the raw scores and control for rater. With just two raters, it may work better to model them as a fixed effect, IDK. — gung - Reinstate Monica, Jul 14 '21 at 20:19

The average of two counts as a non-integral dependent variable in negative binomial models?

1 Answers1