4

The offset argument in the glm() quite troubles me. As below, m3 is the usage of offset that I have seen. m4 is a manually calculated analog. But the result obtained is completely different, with m3 giving the better performance. May I know why they are different and which one is correct?

standard=(ratepy/pop)*100000

m3=glm(as.integer(ratepy)~rainpd+temppd+distLon+offset(I(log(pop/1e5))), family=poisson, data=suit)
m4=glm(as.integer(standard)~rainpd+temppd+distLon, family=poisson, data=suit)
lilkaskitc
  • 77
  • 1
  • 6

1 Answers1

6

The point of the offset is that you do not explicitly transform the response. The rate resulting from the standardization would typically not be an integer and a Poisson model would not fit well then.

Instead one keeps the count response for which a count distribution like Poisson is appropriate and includes log(exposure) as an offset. Then you get

$$ \log(E(response)) = x^\top \beta + \log(exposure) $$

which corresponds to

$$ \log(E(response/exposure)) = x^\top \beta $$

In short: Use the approach from m3.

Achim Zeileis
  • 13,510
  • 1
  • 29
  • 53
  • Why won't the Poisson regression model fit well if the y values are not integers, i.e. if we just model the rate? I realize that the loss function includes a y! term, but this is not used in the optimization to fit the Poisson regression model since log(y!) drops out after differentiating. – Cokes Mar 27 '17 at 17:26
  • I was thinking that the loss function would be the same, but you will get different estimates of the parameters if you use the rate instead of an exposure. See [here](http://stats.stackexchange.com/questions/264071/how-is-a-poisson-rate-regression-equal-to-a-poisson-regression-with-correspondin/270151#270151) – Cokes Mar 28 '17 at 00:05