8

Poisson regression can be conducted using Grouped and ungrouped data. There should be some differences between these two methods. To be sure about it, I have tried to study the differences using a set of simulated data. The result I found was that the estimated parameters will be the same for both methods, but the residual deviances are very different.

This then bring me to the question if there is any assumption that needs to be satisfied before we can grouped our data.

# Rcode for simulated data #
rm(list=ls())
set.seed(1)
##############################################################
# Creating Random Age, Gender, obs count and population      #
##############################################################
nsim = 10000
age = sample(20:70,size = nsim, replace = T)
Gender = sample(c("M","F"),size = nsim, replace = T)
obs.count = sample(c(0,0,1),size = nsim, replace = T)
population = sample(c(0.7,0.8,0.9,1), size=nsim, replace = T)
ungrouped.data = data.frame(age,Gender,obs.count,population)
grouped.data = aggregate(cbind(ungrouped.data$obs.count,ungrouped.data$population),list(ungrouped.data$age,ungrouped.data$Gender), FUN = "sum")
names(grouped.data) = c("age", "Gender", "obs.count", "population")

############################################
# GLM model for group and ungroup data set #
############################################
model.group = glm(obs.count ~ age + Gender + offset((log(population))), family = poisson, data = grouped.data)
summary(model.group)
model.ungroup = glm(obs.count ~ age + Gender + offset((log(population))), family = poisson, data = ungrouped.data)
summary(model.ungroup)  
kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
Fesco
  • 81
  • 3
  • What do you mean by grouped? Sum of counts and observation times with all predictors shared at the group level? If so, then there is no difference and any differing analysis results are due to software-usage errors. – Björn May 02 '17 at 19:33
  • @Björn. Thanks for your reply. Yes, you are right in the description of what I meant by grouping. However, if there is indeed no difference in the analysis, why is there a different result for the residual deviance? – Fesco May 03 '17 at 17:40
  • I have also added the Rcode that I use to simulate the data. If you were to run it, you will see a very different result in the residual deviance. – Fesco May 03 '17 at 17:53
  • 2
    The deviance difference is due to the fact one set has 102 samples while the other has 10000 (see `deviance(model.ungroup)` is roughly 100 times bigger than `deviance(model.group)`). – Firebug May 03 '17 at 17:56
  • 3
    There are some close votes, presumably because this is seen as a software question. But the problem here is conceptual, importanyt in practice,so I think this should be left open. – kjetil b halvorsen May 04 '17 at 08:51

2 Answers2

3

Since the sums of counts by combination of factors in the model together with the anti-logged offsets are the sufficient statistics for a Poisson distribution, there should be no difference between the two analyses. Any differing analysis results are due to software-usage errors.

In this case, the problem is that the R glm function does not know what degrees of freedom to use. This can be a problem with some software, when you use sufficient statistics instead of individual observations. For example, PROC NLMIXED in SAS has the DF option in the PROC NLMIXED statement to deal with this type of problem. I am not sure what the equivalent option in glm is, but I assume it exists.

Björn
  • 21,227
  • 2
  • 26
  • 65
1

My expectation would be that the aggregated model will always have a lower deviance as it does not have to explain the variance within the groups. In the 1-dimensional linear case this is easier to see, as grouped data will usually have higher correlation. See: "Better fit" using aggregated data in comparison to disaggregated data: explanation?

Silverfish
  • 20,678
  • 23
  • 92
  • 180
AndreasM
  • 11
  • 3