8

My question is related, but not the same as the following question: Fitting a Poisson GLM in R - issues with rates vs. counts

Here's some fake data:

### some fake data
x=c(1:14)
y=c(0,  1,  2,  3,  1,  4,  9, 18, 23, 31, 20, 25, 37, 45)
y_rate <- y / 1000

I'm going to use a Poisson GLM with a log link to predict y_rate:

### model
pois_mdl <- glm(y_rate ~ x, family=poisson(link="log"))
summary(pois_mdl)

Plot the fit:

### plot
plot(x, y_rate)
lines(x, pois_mdl$fitted.values)

I am surprised that Poisson glm() allows for non-integer values in the dependent variable. Draws from a Poisson distribution are always integers (regardless of the value of the mean parameter). Why doesn't glm() blow up?

amoeba
  • 93,463
  • 28
  • 275
  • 317
William Chiu
  • 614
  • 5
  • 15

2 Answers2

11

I don't know why glm() doesn't blow up. To figure that out, you'll have to unpack all of the underlying code. (In addition, if your only question is how the R code works, this question is off topic here.)

What I can say is that you are not modeling the rates correctly. If you want to model rates instead of counts, you need to include an offset in the model's formula. (There is a nice discussion on CV of what an offset is here: When to use an offset in a Poisson regression?) Using your example the code would be:

pois_mdl2 <- glm(y~x+offset(log(rep(1000,14))), family=poisson(link="log"))

Note that, although the coefficient estimates are the same, the standard errors are quite different:

summary(pois_mdl2)$coefficients
#               Estimate Std. Error   z value      Pr(>|z|)
# (Intercept) -6.5681214 0.25118701 -26.14833 1.029521e-150
# x            0.2565236 0.02203911  11.63947  2.596237e-31
summary(pois_mdl)$coefficients
#               Estimate Std. Error    z value  Pr(>|z|)
# (Intercept) -6.5681214  7.9431516 -0.8268911 0.4082988
# x            0.2565236  0.6969324  0.3680753 0.7128171
gung - Reinstate Monica
  • 132,789
  • 81
  • 357
  • 650
5

While I don't recommend looking at the source code for glm for those that wish to preserve their mental health, I looked at the source code to glm. The reason R doesn't blow up seems to be that it just doesn't bother to do the kind of defensive checks it probably should.

The main iteratively re-weighted least squares loop works by using the methods attached to a family object of the appropriate type. In this case, that is poisson:

> poi <- poisson()
> class(poi)
[1] "family"

This family object has all that glm needs to fit the model, for example:

> poi$linkfun(1)
[1] 0
> poi$linkinv(1)
[1] 2.718282

Another, here's the derivative of the inverse link:

> poi$mu.eta(1)
[1] 2.718282

The y data comes in on line 258:

dev <- sum(dev.resids(y, mu, weights))

Unfortunately, dev.resids could not care at all whether y is positive integer valued:

> poi$dev.resid(1.5, 1, 1)
[1] 0.2163953

So I guess R doesn't blow up because it didn't think to blow up.

Matthew Drury
  • 33,314
  • 2
  • 101
  • 132
  • 1
    +1, lol. There was a reason I didn't actually look at the source code myself ;-). – gung - Reinstate Monica Aug 20 '15 at 02:55
  • 2
    The check, if there was one, would most likely occur in `poisson()$initialize`, which only checks that the `y`s are non-negative. Compare it to `binomial()$initialize`, which, along with dealing with the possible one- or two-column specification of `y`, also checks that they are integer-valued. – Mark Aug 20 '15 at 07:20
  • Nice @Mark, I missed that. – Matthew Drury Aug 20 '15 at 14:07