Help with Poisson regression accounting for repeated measures

Question

I'm not a statistician, but need to use these clever tools to analyse some data I have.

I have a really simple dataset to analyze (see below. cases=disease counts, pop=total number of subjects sampled per year), but can't seem to settle on the most appropriate model to use.

It is count of a disease across different years.

I am interested in studying the change in disease counts over the study period (2000-2008) and to see if this change is 'significant'.

I am under the impression that poisson regression would be appropriate to model how the disease count changes over time. The disease count is its prevalence - therefore accounting for old and new cases in a year. Therefore, my data violates the assumption that the counts are independent of each other as they are not. Counts in 2001 for example may have been from the same individuals sampled in 2000.

'data.frame':   9 obs. of  4 variables:
 $ year      : int  2000 2001 2002 2003 2004 2005 2006 2007 2008
 $ cases     : int  76 103 110 129 129 135 144 130 147
 $ pop       : int  3766 4012 3993 4111 4086 4100 4060 4132 4084

I have tried performing poisson regression using glmer() by specifying 'year' as a random effect to account for clustering of cases between years, but it gives me this error message:

boundary (singular) fit: see ?isSingular

Maybe it's because my dataset is too small??

My question is quite simply how do I use poisson regression to account for repeated measures in this case?

If I were to assume independence of observations/counts, using poisson regression in glm() gives me some interpretable results showing that the increase in counts over time is statistically significant. Not sure if i can trust these results if I haven't accounted for repeated measures...

Any help, comments are appreciated!

I have added a bit more detail - please check if I am doing this correctly for the type of data I have. Here it is:

> new.prevalence.data
  year cases  pop   logpop
1 2000    60 3700 8.216088
2 2001    70 4000 8.294050
3 2002   100 3990 8.291547
4 2003   130 4100 8.318742
5 2004   140 4086 8.315322
6 2005   140 4100 8.318742
7 2006   167 4060 8.308938
8 2007   170 4132 8.326517
9 2008   175 4084 8.314832

Distribution of counts looks like this

poisson.model.rate<-glm(cases~year+offset(logpop), family=poisson, data=new.prevalence.data)

    > summary(poisson.model.rate)

Call:
glm(formula = cases ~ year + offset(logpop), family = poisson, 
    data = new.prevalence.data)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-1.81701  -1.37797   0.08085   1.02010   1.76332  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -229.96091   23.70549  -9.701   <2e-16 ***
year           0.11301    0.01182   9.557   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 107.55  on 8  degrees of freedom
Residual deviance:  13.72  on 7  degrees of freedom
AIC: 77.391

Number of Fisher Scoring iterations: 4

My dispersion parameter here is 1.96. Therefore I've proceeded to use the quasipoisson model:

Call:
glm(formula = cases ~ year + offset(logpop), family = quasipoisson, 
    data = new.prevalence.data)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-1.81701  -1.37797   0.08085   1.02010   1.76332  

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -229.9609    33.0696  -6.954 0.000220 ***
year           0.1130     0.0165   6.851 0.000242 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for quasipoisson family taken to be 1.946079)

    Null deviance: 107.55  on 8  degrees of freedom
Residual deviance:  13.72  on 7  degrees of freedom
AIC: NA

Number of Fisher Scoring iterations: 4

Does anyone see a problem with using this model to reach the conclusion that prevalence increases with time and that this increase is significant? i.e. a 1 unit increase in year results in a multiplicative increase by a factor of exp(0.113) that is statistically significant (p<0.05)?

Thank you

Try `glm(..., family=quasipoisson)` for more background read here: https://stats.stackexchange.com/questions/201903/how-to-deal-with-overdispersion-in-poisson-regression-quasi-likelihood-negativ — AdamO, Mar 25 '21 at 20:11
The Poisson model makes a kind of restrictive assumption (mean = variance) that can cause the model not to converge--you can maybe use a negative binomial or quasipoisson model. You can use `pacf()` to check the partial autocorrelation of the residuals to see if your data have strong correlation as well. If they do you, you should account for the correlation structure in your model, but your sample size may not be large enough for meaningful estimates. — wz-billings, Mar 25 '21 at 20:23
Thanks for your comments. My data do not appear to be overdispersed. If anything they are very slightly underdispersed. I'm not too concerned about accounting for over/underdispersion of data at this stage. I guess my main question is whether or not using poisson regression is a valid method to use with count data that is NOT independent? I think using quasipoisson or negative binomial regression do not really fix this particular issue? Any other suggestions/solutions? — beta, Mar 25 '21 at 21:06
@beta you may be confused here. First off: dispersion is intrinsically related to correlation. A quasipoisson model is for the case when the variance is not equal to the mean, but proportional to the mean. Positively correlated samples cause dispersion < 1, aka underdispersion. It is precisely quasipoisson modeling that handles this underdispersion. — AdamO, Mar 25 '21 at 22:00

score 1 · Accepted Answer · answered Mar 25 '21 at 22:09

1

If you have a panel of identified individuals that were evaluated each year, then one way to deal with intra-individual correlations would be to include the individuals' IDs as random effects. The data for example could be formatted with one row per individual per year, annotated with the year, the ID and the event counts for the year.

At the other extreme, if these represent samples of a small proportion taken from a large underlying population, then the chance of re-sampling the same individual is small and you don't have to worry much if at all about intra-individual correlations.

As AdamO suggests in comments, a quasiPoisson model should handle any remaining "unmeasured correlation" well in that latter situation, or if you are at an intermediate level of sampling from the population.

answered Mar 25 '21 at 22:09

EdM

57,766
7
66
187

Yes I have this data. When you say 'event counts for the year' - for an individual (ID) in a particular year - the event count will only either be 0 or 1 (as it is binomial - disease present vs. disease not present). Is this what you mean or is there a different way of incorporating event counts for the year in this 'long' format of data. – beta Mar 26 '21 at 08:29
Your second paragraph does somewhat reassure me - the only issue I have is that as I have data at an individual level, I know personally that an individual that is counted as having disease in one year has also been counted as having disease in other years. Your answer has been helpful. I have added a bit more detail to my question if you have time to have a look. Thank you & apologies if I'm not making much sense as I'm clearly a noob at this subject! – beta Mar 26 '21 at 08:36
@beta the quasi-Poisson analyisi behaved as hoped: similar coefficient estimates, but higher standard errors that should account for the issues you identified. A couple of cautions. First, my second paragraph was written in the context of taking repeated _random_ samples. If you had a panel of the same individuals evaluated from year to year then a random-effect model might be more appropriate. Second, can an individual have disease one year and _not_ have it in a subsequent year? If that's not the case, and you have a panel, then be careful in how you interpret the apparent prevalence trend. – EdM Mar 26 '21 at 16:03
I am looking at the presence of infection rather than disease per se. So actually, the same person may have tested positive for infection in one year then tested negative in the following year. I have used GEE in R to incorporate random effects. As my outcome using the long format of data will be disease - yes/no - rather than count data - my regression analysis becomes a logistic one. I am not longer sure if this type of model specs helps me answer the question of prevalence trends over time. I get an intercept of -103 and a coefficient (year) of 0.04 and have no idea how to interpret this. – beta Mar 26 '21 at 17:11
the code I used was gee(disease_status~Year, data=data_long, family = "binomial"(link="logit"), corstr = "unstructured",id=patient_ID, na.action = na.omit) – beta Mar 26 '21 at 17:14
@beta coefficients in a logistic regression are based on [log-odds](https://en.wikipedia.org/wiki/Logit). Intercept: log-odds when covariates are at baseline/0 values. Intercept in your models might be easier to interpret if you subtract 2000 from the Year values so that the intercept represents the Year 2000 instead of Year 0. Slope coefficient: _change_ in log-odds per unit _change_ in year. If that coefficient for `Year` is significant then you have answered your question. – EdM Mar 26 '21 at 17:46
you've been extremely helpful. Thanks so much! – beta Mar 27 '21 at 09:45

Help with Poisson regression accounting for repeated measures

1 Answers1