What do p-values for levels of a categorical variable represent in Poisson regression?

Question

I have a Poisson model with varying densities:

set.seed(1)
df = data.frame(density = 1:5, events = rpois(2000, 1:5))

If I regress on this, I get that the intercept is approximately log(3), which makes sense because 3 is the mean of 1:5.

glm(events ~ 1, df, family = poisson)  # returns 1.089

But now suppose I want to read back the coefficients of density:

glm(events ~ as.factor(density), df, family = poisson)

(For simplicity I've used density as both the ID of the field and its density.) I would expect the coefficient of density[i] to be log(3-i) because the intercept would still be 3. However, it doesn't seem like the intercept remains 3 - in this case, the intercept is set to log(1). In playing around with this, it seems like glm sets the intercept to be the coefficient of the first factor.

Now I'm starting to wonder what the p values in a glm regression indicate. Is the null hypothesis that density[i] is the same as the intercept (aka density[1])? Or is it that density[i] = mean(density)?

Maybe [this answer](http://stats.stackexchange.com/a/60821/21054) will help. In short, the coefficients for the categories of `density` are the difference of each class to the reference class (which is `density1`, in this case). The $p$-values are for the hypothesis that the pairwise differences are 0 vs. nonzero. The coefficient of `density1` is simply the intercept, as you already noted. — COOLSerdash, Oct 03 '13 at 17:58

score 6 · Accepted Answer · edited Oct 04 '13 at 08:47

The model coefficients are estimated contrasts based on how the data frame generates contrasts in the factor levels for density. Take a look at this:

fit <- glm(events ~ as.factor(density), df, family = poisson)

model.matrix(fit)

To see how these contrasts are estimated, store the GLM as an object in the workspace. The intercept in this case is now the average log rate when density is equal to 1 (which is the log of 1, i.e. close to 0). Each of the parameters, such as the first, which is labeled as.factor(density)2 is the log relative rate comparing events when density is equal to 2 versus density equal to 1.

Each of these model parameters has a known limiting asymptotic distribution due to the central limit theorem. The theory on this is well understood, but a bit advanced. Consult McCullagh & Nelder, "Generalized Linear Models" for a statement of the result. Basically, as with linear regression, the natural parameters in the generalized linear models converge to a normal distribution under replications of the study. Thus, we can calculate the limiting distribution under the null hypothesis and calculate the probability of observing model coefficients as inconsistent or more inconsistent than what was experimentally obtained. This is very similar to the usual interpretation of a $p$-value as obtained from OLS model parameters, or simple Pearson tests of contingency tables, or the t-test.

Note that, had you removed the as.factor coding of density, you would have estimated an averaged log relative rate comparing values of density differing by 1 unit, and the intercept would have been the interpolated to be the log event rate when density=0, which may or may not be a useless quantity. The log relative rates in the data you generated are not constant, so the model effects would represent an "averaged effect".

For instance:

   ## the actual relative rates comparing subsequent density values
relRates <- exp(diff(log(1:5)) 

modelFit <- glm(events ~ density, data=df, family=poisson)
   ## model based relative rate, weighted by random data
exp(coef(modelFit))[2] 
   ## the approximate average log relative rate, converted to relative rate
exp(mean(log(relRates))

Is there a way to regress which sets intercept as though I were doing `events ~ 1` and the hypothesis is that the coefficients for the other factors are different from zero? — Xodarap, Oct 03 '13 at 18:46
This is a good answer, @Xodarap. Re: your question in the above comment, you have to think about what it means to be zero; it may help to read my answer as well for that point. — gung - Reinstate Monica, Oct 03 '13 at 19:07
Thanks @gung, AdamO - this helps me understand why glm is doing what it's doing. I'm trying to find a poisson analog of ANOVA - I [had thought](http://stats.stackexchange.com/questions/51208/how-can-i-test-the-likelihood-that-two-poisson-data-sets-are-drawn-from-the-same) this was it, but it seems weird if I have more than two categories. Am I misunderstanding `glm`, or is there something else I should be doing? — Xodarap, Oct 03 '13 at 19:47
@Xodarap yes. The default `contrast` method is `contr.treatment` (try evaluating `contr.treatment(5)` to see what's exactly passed to `model.matrix` for creating the intercept and effects). To get an ANOVA style contrast (value versus grand mean), you need to add the command, `contrasts=list(factor(density)="contr.helmert")` to your glm statement. — AdamO, Oct 03 '13 at 20:10
I think this is the right way to go about running the Poisson analog of ANOVA. You fit your full model w/ all levels of the factor & a nested model w/o the factor. Then you just need to perform a nested model test. In `R`, this can be done via `anova(model1, model2)`. — gung - Reinstate Monica, Oct 03 '13 at 20:15
@gung I echo this suggestion, it's important to realize what the model parameters are testing, but the summary from `coef(summary(modelFit))` does not suffice to simultaneously test for multiple dependent effects. Fitting two (nested) models provides a way to do this. There is also the library `lmtest` which allows the user to use either likelihood ratio or wald tests separately, although no score test has been implemented which is unfortunate. — AdamO, Oct 04 '13 at 18:26

score 4 · Answer 2 · edited Apr 13 '17 at 12:44

R, by default, uses reference cell coding (which I explain here: regression-based-for-example-on-days-of-week). Note that, this is called using "treatment contrasts" in R. (Many types of coding schemes are described at UCLA's stats help site.)

As @COOLSerdash states, the p-values for your indicated factor levels are testing the null hypothesis that their intercepts are equivalent to the intercept of the reference category. The intercept for your reference level is simply called (Intercept) in the output. The null hypothesis tested there is that the data come from a population where the true value is $0$. (I am referring to "intercepts" here, whereas you are referring to "means", but remember that if you don't have a continuous variable, i.e. you only have y~1, the intercept = the mean.)

Let's take a quick look at the Poisson model:
$$ \ln(E(Y)) = \ln(\lambda) = \beta_0 + \beta_1X \\ $$ $$ \lambda = \exp(\beta_0 + \beta_1X) \\ $$ Since you have no $X$, it's just $\ln(\lambda) = \beta_0$. If $\beta_0 = 0$, then $\lambda = \exp(0) = 1$. Hence that is what is being tested for the reference category.

We can also look at this in R:

> set.seed(1234)
> y = rpois(10000, 1)
> summary(glm(y~1, family="poisson"))

...
Coefficients:
             Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.0009995  0.0099950     0.1     0.92
...
> exp(0.0009995)
[1] 1.001
> mean(y)
[1] 1.001

> y2 = rpois(1000, 2)
> summary(glm(y2~1, family="poisson"))

...
Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  0.72900    0.02196   33.19   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
...
> exp(0.72900)
[1] 2.073007
> mean(y2)
[1] 2.073

What do p-values for levels of a categorical variable represent in Poisson regression?

2 Answers2

Linked