Using Rates in a Poisson Distribution

Question

OSHA keeps records of fatal injuries in deaths per 100,000 full time equivalent workers. I'd like to see if improvement over time is significant from 2006 (3.7) to now (2.8).

My instinct is to model this as a Poisson distribution for a population of a million full time workers, set lambda to 28 and the number of events to 37, but I am concerned that this would be invalid since a Poisson distribution is supposed to describe discrete events rather than rates. Am I setting myself up for invalid results?

Why one million full-time workers? Do you think that's (approximately) the real population size from which the rates were calculated? — Scortchi - Reinstate Monica, Jan 13 '17 at 18:46
You don't have to invent population sizes: the data you need are available from OSHA via the BLS site. One summary is at https://stats.bls.gov/iif/oshwc/cfoi/cfch0014.pdf. — whuber, Jan 13 '17 at 19:02
I did see the population sizes (I pulled my data from the second slide from your link) but I am interested in the rates; I want to see if changing safety regulations have an impact over the past decade independent of labor participation. I chose one million simply so I could multiply 2.8 and 3.7 by ten to get whole numbers. I guess I hope a Poisson distribution is linear and scalable. — Heath Vincent, Jan 13 '17 at 19:19
2.8 & 3.7 are doubtless rounded anyway - that approach would be quite wrong, as is hopefully clear from @JonB's answer. See also [Poisson Distribution CI - are the limits scalable?](http://stats.stackexchange.com/q/254790/17230) — Scortchi - Reinstate Monica, Jan 13 '17 at 20:28

score 7 · Accepted Answer · edited Apr 13 '17 at 12:44

You can model it as a rate if you have the number of fatal injuries (diagram #1 in the link provided by whuber) and the total number of workers each year. Because the numbers of workers are not shown, they can be approximated by dividing number of injuries by the rate (diagram #2) per 100000 and then multiply by 100000.

In Poisson regression, rates are modeled using offset variables. In short, you're taking the population size for each year (number of workers each year) into account without estimating regression coefficients for the population size. This is brilliantly explained by the user ocram in the first answer to another question: When to use an offset in a Poisson regression? In that example, time is used as the offset variable to model rate of events per time, but the idea is the same as in this situation.

Now, all we have to do is enter the data from the link, calculate the approximate total number of workers for each year, and then run a Poisson regression model using the log total numbers as the offset variable. Using R notation and output:

# enter number of fatal work injuries 2006-2015 from diagram #1
events <- c(4808, 4613, 4183, 3488, 3651, 3642, 3571, 3635, 3728, 3751)

# enter rates from diagram #2
rates <- c(3.7, 3.5, 3.2, 2.8, 3.0, 2.9, 2.8, 2.8, 2.8, 2.8)

# years 2006-2015 where 2006 = 1
year <- seq(1:10)

# calculate approximate population size each year
population <- 100000 * events/rates

summary(glm(events ~ offset(log(population)) + year, family=poisson))

Giving the results:

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -10.240924   0.010503 -975.08   <2e-16 ***
year         -0.030207   0.001747  -17.29   <2e-16 ***

As you can see, there is a highly significant effect of time on the rates of fatal injuries. The rate of fatal injuries for a particular year is estimated as exp(-0.030207) = 97% of the rate in the past year. Now, just to check our results, we'll calculate the fatal injury rate in 2015 using data from 2006: 3.7 * exp(-0.030207) ^ 9 = 2.82 which is what we expected.

+1, but to be safer, I might fit years 2007-2015 w/ lag1 as a covariate or using an autocorrelation consistent sandwich estimator. — gung - Reinstate Monica, Jan 13 '17 at 21:13
@Gung good suggestions. But all of this is overkill for the question that was asked. Since the totals are in the thousands and the Poisson variance equals its mean, the standard errors are around 60 to 70. The difference between 4808 (the 2006 total) and 3751 (the 2015 total)--whether or not those numbers are adjusted for changes in population--will itself have a standard error of 100 or so, corresponding to a z-score around 10: it's unquestionably significant. Similar considerations show all the improvement occurred in 2006-2009. — whuber, Jan 13 '17 at 21:52
Thank you gung. I tried your suggestion with lag1, and I have a pretty vague idea about what autocorrelation is, but I don't understand why we need to account for that here. Is it to account for the possibility that the logged association between time and rate is not linear? — JonB, Jan 13 '17 at 22:00
Autocorrelation means that the data are not truly independent. So that reduces this case to a simpler one you're familiar with: when the data aren't independent the SEs are wrong (typically the CIs are too narrow). A common strategy is to use autocorrelation consistent SEs (sandwich errors), another is to include a lag variable as a covariate. It isn't necessarily clear which strategy is right. In this case, the former makes the SE .006 (still very significant), but adding the lag changes the slope to -.005 & the SE to .004 (nonsignificant). I suspect that's too strong a correction here. — gung - Reinstate Monica, Jan 14 '17 at 01:06

Using Rates in a Poisson Distribution

1 Answers1