Question about regression using a rate vs a count as the dependent variable

Question

Let's say I wanted to run a regression of mortality on an x variable, so I could see 3 different possibilities I can do

I define a rate: $y = \left(\frac{\#deaths}{population}\right)$

and regress

$$y = xb + \epsilon$$

via OLS

I can take log(#deaths) as the y (or inverse hyperbolic sine to account for zeros), and regress:

$$\log(\#deaths) = xb + population + \eta$$

, thereby adjusting for population on the right hand side rather than the left hand side.
Run a Poisson regression with the # of deaths as the dependent variable (Just to clarify, for Poisson the $y$ would $\ne$ the raw number of deaths correct? Not log deaths?)

So my question is, is there an accepted/clear cut answer for why one of these methods would be better than the other?

One important consideration in thinking about "better than" is whether your inferential target is **individuals**, or **populations**, since the causes of individual incident cases ≠ the causes of population incidence rates. For example, a strictly enforced, and even ruthless isolation intervention for individuals exposed to a terrifyingly contagious and mostly lethal disease will almost certainly harm the sick individuals so 'treated', but at appropriate scales, will reduce the incidence rate in populations implementing such isolation policies. — Alexis, Sep 28 '20 at 17:25
In your item 2, surely that should be log(population) on the RHS. For your Poisson model the usual thing would be to use a log-link and have log-population as an offset. Many posts already on site discuss these issues. — Glen_b, Sep 29 '20 at 09:46

score 1 · Accepted Answer · answered Sep 28 '20 at 17:23

You want to make sure that your analysis properly weights cases and that you capture the error variance appropriately.

The advantage of the Poisson generalized linear model is that it has a chance of doing all that at once. The standard logarithmic link between the linear predictor and the counts means that you don't take the log of the counts yourself. You adjust for the log of the population size in an offset term* that forces its regression coefficient to be exactly one. The results are thus equivalent to an analysis of rate. As the variance of a Poisson distribution must equal its mean, the analysis properly weights samples and takes into account the error variance, if the Poisson distribution is an adequate model.

The first option you present would equally weight a population of 10 members having one event and a population of 10,000 members with 1000 events. That wouldn't make much sense, as the rate estimate from the larger population is much more precise. The second, as written, would model the log of deaths as a linear function of the population size, which I don't think that you want if you are essentially modeling rates. Using log of the population would help, but unless you treat that as an offset you would not be modeling rates (and I'm not immediately sure how well that would work if you used the inverse hyperbolic sine).

It probably makes the most sense to try the Poisson model, see if the assumptions underlying the model hold well enough, and if not use one of related models like quasipoisson or negative binomial.

*The example in the linked page is for events over time, but the principle holds for events among a population.

Question about regression using a rate vs a count as the dependent variable

1 Answers1