28

(First of all, just to confirm, an offset variable functions basically the same way in Poisson and negative binomial regression, right?)

Reading about the use of an offset variable, it seems to me that most sources recommend including that variable as an option in statistical packages (exp() in Stata or offset() in R). Is that functionally the same as converting your outcome variable to a proportion if you're modeling count data and there is a finite number that the count could have happened? My example is looking at employee dismissal, and I believe the offset here would simply be log(number of employees).

And as an added question, I am having trouble conceptualizing what the difference is between these first two options (including exposure as an option in the software and converting the DV to a proportion) and including the exposure on the RHS as a control. Any help here would be appreciated.

RickyB
  • 951
  • 1
  • 10
  • 21
  • 1
    Not correct about choosing offset. Usually you will want the offset to be the log(number at risk) or log(person-time) since the link is log(). – DWin Aug 08 '13 at 06:33
  • Sorry, I'm not sure I understand - which part is not correct? Are you saying that the offset chosen should be log([number at risk]) and not just [number at risk]? Oops, I actually did mean to write that in the OP - I'll edit that now. – RickyB Aug 08 '13 at 21:34
  • Just a note to add that Stata can complicate the issue of what is going into the model -- `exp()` requires the pre-calculated log(number at risk) as per @DWin's answer; while `offset()` takes the original (number at risk) as an input, but calculates the log of this for you before using in the model. So they return the same results, but require different forms of input. – James Stanley Aug 08 '13 at 21:53
  • Does that mean that the function call by Hong below is then incorrect? Should it be offset(Holders) instead? Because his estimated coefficients did stay the same... – RickyB Aug 08 '13 at 22:41
  • @RickyB -- no, because HongOoi's (very good) answer is in R, which asks for `offset()` and requires the log transform of the variable to be passed to the model. [Sorry for additional confusion in my attempt to reduce confusion -- in other words, `offset` in R corresponds to `exp` in Stata from a syntax point of view...] – James Stanley Aug 08 '13 at 22:51
  • 2
    A quick comment on using offsets -- these are in most instances vital to having an appropriate model. I tell students that the need for the offset variable is to compare rates of events between groups [or covariate values] rather than comparing the absolute counts of events -- with the former almost always being our question of interest. – James Stanley Aug 08 '13 at 22:56
  • @JamesStanley: Your comments are very helpful. As you might have guessed I am an R (and former GLIM) user. I'm thinking that Stata's authors might have saved us all some confusion by naming the "exp" function "expected". I would have imagined it to be the exponential function (as it is in almost every other language.) – DWin Aug 08 '13 at 23:43
  • @Dwin: Some slight idiosyncracies of Stata come into play in that it allows options to be specified in shortened form -- the full option name is `exposure` but one can shorten this to `exp` or `e` or anything in-between! (and as you'd expect, `exp` used as a simple function, rather than an option as to the `poisson` command here, returns the exponential -- e.g. `disp exp(1)` prints 2.7182818 to screen) – James Stanley Aug 08 '13 at 23:57
  • Ugh. Partial name matching. (R has some of that, too.) – DWin Aug 08 '13 at 23:59
  • Thank you for your help. And yes...as someone that switches between Stata and R on a regular basis, I get tripped up on these things a lot when I forget in which environment I'm working. – RickyB Aug 09 '13 at 23:21
  • @JamesStanley Sorry to ask more about this, but it seems like your first comment above says that `offset()` in R takes the original number as an input and then takes the log for you, but your second comment says that `offset()` in R requires the log transform of the variable to be passed to the model. Your second comment seems consistent with Hong Ooi's answer. When I try this on my own, R throws an error if I supply the raw number as the offset, but it works if I supply the log(number). Is there any situation where using the raw number as the offset is correct? – gannawag Jan 26 '17 at 19:46
  • @gannawag The analysis itself always requires the log(person-time) value. R requires you to calculate the log yourself -- either previously (i.e. make a new variable), or at the time you call the function (by including log(number) as the argument to offset. – James Stanley Jan 26 '17 at 20:04
  • @gannwag [cont.] My first comment was specifically about Stata options (which OP had noted). The potential confusion is due to similar option/argument names for slightly discordant options/arguments -- Stata's `offset(number)` option is equivalent to calling R with `offset=log(number)` – James Stanley Jan 26 '17 at 20:11

2 Answers2

65

Recall that an offset is just a predictor variable whose coefficient is fixed at 1. So, using the standard setup for a Poisson regression with a log link, we have:

$$\log \mathrm{E}(Y) = \beta' \mathrm{X} + \log \mathcal{E}$$

where $\mathcal{E}$ is the offset/exposure variable. This can be rewritten as

$$\log \mathrm{E}(Y) - \log \mathcal{E} = \beta' \mathrm{X}$$ $$\log \mathrm{E}(Y/\mathcal{E}) = \beta' \mathrm{X}$$

Your underlying random variable is still $Y$, but by dividing by $\mathcal{E}$ we've converted the LHS of the model equation to be a rate of events per unit exposure. But this division also alters the variance of the response, so we have to weight by $\mathcal{E}$ when fitting the model.

Example in R:

library(MASS) # for Insurance dataset

# modelling the claim rate, with exposure as a weight
# use quasipoisson family to stop glm complaining about nonintegral response
glm(Claims/Holders ~ District + Group + Age,
    family=quasipoisson, data=Insurance, weights=Holders)

Call:  glm(formula = Claims/Holders ~ District + Group + Age, family = quasipoisson, 
    data = Insurance, weights = Holders)

Coefficients:
(Intercept)    District2    District3    District4      Group.L      Group.Q      Group.C        Age.L        Age.Q        Age.C  
  -1.810508     0.025868     0.038524     0.234205     0.429708     0.004632    -0.029294    -0.394432    -0.000355    -0.016737  

Degrees of Freedom: 63 Total (i.e. Null);  54 Residual
Null Deviance:      236.3 
Residual Deviance: 51.42        AIC: NA


# with log-exposure as offset
glm(Claims ~ District + Group + Age + offset(log(Holders)),
    family=poisson, data=Insurance)

Call:  glm(formula = Claims ~ District + Group + Age + offset(log(Holders)), 
    family = poisson, data = Insurance)

Coefficients:
(Intercept)    District2    District3    District4      Group.L      Group.Q      Group.C        Age.L        Age.Q        Age.C  
  -1.810508     0.025868     0.038524     0.234205     0.429708     0.004632    -0.029294    -0.394432    -0.000355    -0.016737  

Degrees of Freedom: 63 Total (i.e. Null);  54 Residual
Null Deviance:      236.3 
Residual Deviance: 51.42        AIC: 388.7
Hong Ooi
  • 7,629
  • 3
  • 29
  • 52
4

The offset does act similarly for both Poisson and NB. The offset has two functions. For Poisson models, the actual number of events defines the variance, so that's needed. It also provides the denominator, so you can compare rates. It's unite-less.

Just using a ratio will mess up the standard errors. Having a model,that deals with the offset as most Poisson regression model functions do takes care of both the standard errors AND comparing rates.

user28926
  • 41
  • 1
  • 2
    What do you mean by mess up the SEs? – dimitriy Aug 08 '13 at 21:45
  • Is the reasoning here that the ratio will no longer properly fit the Poisson distribution? – RickyB Aug 08 '13 at 21:50
  • 6
    Thread necromancy, but for anyone reading this in 2017, I think this means converting to a ratio erases the information from the total number of trials: 1 out of 10 should have a different SE than 10 out of 100, but they both get converted to 0.1. – Patrick B. Oct 05 '17 at 23:03