How to treat proportions with different sample sizes as response variable in GLM / GLMM?

Question

For an ecological research project, I am trying to model the effect of different factors on the prevalence of a specific pathogen in ticks.

Ticks were collected from around 80 different plots and screened for pathogens. Prevalence is then the proportion of ticks tested positive per site.

Ideally, I would like to use prevalence as the response variable in a GLM or GLMM. It follows a negative binomial distribution. I would then treat this proportion as a count variable (no. of positives out of 100 samples).

However, the number of ticks collected varies across sites. (From 1 to 110)

The prevalence calculated for plots with smaller sample sizes is obviously less reliable compared to plots with larger sample sizes. Is there a way to include a measure of this uncertainty in the analysis?

For now, I excluded all plots with sample sizes < 50, but this doesn't seem like an elegant solution.

Should I try a different modeling approach? I feel like there must be a better way to do this.

All help is greatly appreciated!

score 0 · Answer 1 · answered Nov 10 '21 at 15:19

A mixed model sounds like a good way to do this. The sites with fewer observations will be pooled towards the population level estimate and sites with more observations will depart from that estimate. Here is an example in R

library(tidyverse)
library(lme4)

#Simulate data

nsites <- 80
sites <- 1:nsites
ticks_sampled <- sample(1:110, replace=T, size=nsites)

# Assume the population prevalence is 20%
# Further assume there is heterogeneity in the prevalence
# between sites.
Z = model.matrix(~factor(sites))
g = rnorm(ncol(Z), 0, 0.25)
logit_p = qlogis(0.2) + Z%*%g
p = plogis(logit_p)
y = rbinom(nsites, ticks_sampled, p)

# Fit the model
d = tibble(sites, ticks_sampled, y)

model = glmer(y~1 + (1|sites) + offset(log(ticks_sampled)), data = d, family = poisson())

Generalized linear mixed model fit by maximum likelihood (Laplace Approximation) ['glmerMod']
 Family: poisson  ( log )
Formula: y ~ 1 + (1 | sites) + offset(log(ticks_sampled))
   Data: d

     AIC      BIC   logLik deviance df.resid 
   428.3    433.0   -212.1    424.3       78 

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-2.0381 -0.5826 -0.1137  0.5330  2.4711 

Random effects:
 Groups Name        Variance Std.Dev.
 sites  (Intercept) 0.039    0.1975  
Number of obs: 80, groups:  sites, 80

Fixed effects:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.64026    0.04257  -38.53   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Note our estimate on the log scale is very close to the true value of 20 per 100 samples. Additionally, a scatter plot of the random effects versus sample size demonstrates this partial pooling effect

How to treat proportions with different sample sizes as response variable in GLM / GLMM?

1 Answers1