How does glmnet handle overdispersion?

Question

I have a question about how to model text over count data, in particular how could I use the lasso technique to reduce features.

Say I have N online articles and the count of pageviews for each article. I've extracted 1-grams and 2-grams for each article and I wanted to run a regression over the 1,2-grams. Since the features (1,2-grams) are way more than the number of observations, the lasso would be a nice method to reduce the number of features. Also, I've found glmnet is really handy to run lasso analysis.

However, the count number of pageviews are overdispersed (variance > mean), but glmnet doesn't offer quasipoisson (explicitly) or negative binomial but poisson for count data. The solution I've thought of is to log transform the count data (a commonly used method among social scientists) and make the response variable roughly follow a normal distribution. As such, I could possibly model the data with the gaussian family using glmnet.

So my question is: is it appropriate to do so? Or, shall I just use poisson for glmnet in case glmnet handles quasipoisson? Or, is there other R packages handle this situation?

Thank you very much!

score 18 · Accepted Answer · edited Apr 13 '17 at 12:44

Short answer

Overdispersion does not matter when estimating a vector of regression coefficients for the conditional mean in a quasi/poisson model! You will be fine if you forget about the overdispersion here, use glmnet with the poisson family and just focus on whether your cross-validated prediction error is low.

The Qualification follows below.

Poisson, Quasi-Poisson and estimating functions:

I say the above because overdispersion (OD) in a poisson or quasi-poisson model influences anything to do with the dispersion (or variance or scale or heterogeneity or spread or whatever you want to call it) and as such has an effect on the standard errors and confidence intervals but leaves the estimates for the conditional mean of $y$ (called $\mu$) untouched. This particularly applies to linear decompositions of the mean, like $x^\top\beta$.

This comes from the fact that the estimating equations for the coefficients of the conditional mean are practically the same for both poisson and quasi-poisson models. Quasi-poisson specifies the variance function in terms of the mean and an additional parameter (say $\theta$) as $Var(y)=\theta\mu$ (with for Poisson $\theta$=1), but the $\theta$ does not turn out to be relevant when optimizing the estimating equation. Thus the $\theta$ plays no role in estimating the $\beta$ when conditional mean and variance are proportional. Therefore the point estimates $\hat{\beta}$ are identical for the quasi- and poisson models!

Let me illustrate with an example (notice that one needs to scroll to see the whole code and output) :

> library(MASS)
> data(quine) 
> modp <- glm(Days~Age+Sex+Eth+Lrn, data=quine, family="poisson")
> modqp <- glm(Days~Age+Sex+Eth+Lrn, data=quine, family="quasipoisson")
> summary(modp)

Call:
glm(formula = Days ~ Age + Sex + Eth + Lrn, family = "poisson", 
    data = quine)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-6.808  -3.065  -1.119   1.819   9.909  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  2.71538    0.06468  41.980  < 2e-16 ***
AgeF1       -0.33390    0.07009  -4.764 1.90e-06 ***
AgeF2        0.25783    0.06242   4.131 3.62e-05 ***
AgeF3        0.42769    0.06769   6.319 2.64e-10 ***
SexM         0.16160    0.04253   3.799 0.000145 ***
EthN        -0.53360    0.04188 -12.740  < 2e-16 ***
LrnSL        0.34894    0.05204   6.705 2.02e-11 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 2073.5  on 145  degrees of freedom
Residual deviance: 1696.7  on 139  degrees of freedom
AIC: 2299.2

Number of Fisher Scoring iterations: 5

> summary(modqp)

Call:
glm(formula = Days ~ Age + Sex + Eth + Lrn, family = "quasipoisson", 
    data = quine)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-6.808  -3.065  -1.119   1.819   9.909  

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   2.7154     0.2347  11.569  < 2e-16 ***
AgeF1        -0.3339     0.2543  -1.313 0.191413    
AgeF2         0.2578     0.2265   1.138 0.256938    
AgeF3         0.4277     0.2456   1.741 0.083831 .  
SexM          0.1616     0.1543   1.047 0.296914    
EthN         -0.5336     0.1520  -3.511 0.000602 ***
LrnSL         0.3489     0.1888   1.848 0.066760 .  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for quasipoisson family taken to be 13.16691)

    Null deviance: 2073.5  on 145  degrees of freedom
Residual deviance: 1696.7  on 139  degrees of freedom
AIC: NA

Number of Fisher Scoring iterations: 5

As you can see even though we have strong overdispersion of 12.21 in this data set (by deviance(modp)/modp$df.residual) the regression coefficients (point estimates) do not change at all. But notice how the standard errors change.

The question of the effect of overdispersion in penalized poisson models

Penalized models are mostly used for prediction and variable selection and not (yet) for inference. So people who use these models are interested in the regression parameters for the conditional mean, just shrunk towards zero. If the penalization is the same, the estimating equations for the conditional means derived from the penalized (quasi-)likelihood does also not depend on $\theta$ and therefore overdispersion does not matter for the estimates of $\beta$ in a model of the type:

$g(\mu)=x^\top\beta + f(\beta)$

as $\beta$ is estimated the same way for any variance function of the form $\theta \mu$, so again for all models where conditional mean and variance are proportional. This is just like in unpenalized poisson/quasipoisson models.

If you don't want to take this at face value and avoid the math, you can find empirical support in the fact that in glmnet, if you set the regularization parameter towards 0 (and thus $f(\beta)=0$) you end up pretty much where the poisson and quasipoisson models land (see the last column below where lambda is 0.005).

> library(glmnet)
> y <- quine[,5]
> x <- model.matrix(~Age+Sex+Eth+Lrn,quine)
> modl <- glmnet(y=y,x=x, lambda=c(0.05,0.02,0.01,0.005), family="poisson")
> coefficients(modl)
8 x 4 sparse Matrix of class "dgCMatrix"
                    s0         s1         s2         s3
(Intercept)  2.7320435  2.7221245  2.7188884  2.7172098
(Intercept)  .          .          .          .        
AgeF1       -0.3325689 -0.3335226 -0.3339580 -0.3340520
AgeF2        0.2496120  0.2544253  0.2559408  0.2567880
AgeF3        0.4079635  0.4197509  0.4236024  0.4255759
SexM         0.1530040  0.1581563  0.1598595  0.1607162
EthN        -0.5275619 -0.5311830 -0.5323936 -0.5329969
LrnSL        0.3336885  0.3428815  0.3459650  0.3474745

So what does OD do to penalized regression models? As you may know, there is still some debate about the proper way to calculate standard errors for penalized models (see e.g., here ) and glmnet is not outputting any anyway, probably for that reason. It may very well be that the OD would influence the inference part of the model, just as it does in the non-penalized case but unless some consensus regarding inference in this case is reached, we won't know.

As an aside, one can leave all this messiness behind if one is willing to adopt a Bayesian view where penalized models are just standard models with a specific prior.

@Mono, thank you for your very detailed explanation! Here's my understanding, and please correct me if I'm wrong: `poisson` and `quasipoisson` regressions estimate coefficients in the same way and what they differ is how they estimate standard errors and thus significance. However, for the lasso method, how to calculate standard errors is yet to reach a consensus, and hence its current use mostly lies in variable selection rather than inference. As such, it doesn't matter if we use `glmnet` with poisson or quasipoisson, but what does is that cross-validated error should be minimized. — Sonya S., Jul 13 '14 at 16:54
@Mono, another note, I ran `summary(modqp)` myself and saw it having the exactly same coefficient estimations. I believe your answer will benefit more people on this issue because I hadn't found any, so I suggest you add the output of summary(modqp) for an even better illustrated example. Again, many thanks! — Sonya S., Jul 13 '14 at 16:58
@Sonya Yours is a good summary. The key is that when estimating the parameters for the conditional mean, the estimating functions (say the score function) for poisson and quasipoisson are the same! It therefore does not matter for these parameters whether there is a penalization or not as long as it is the same penalization. I make that more clear above. Thanks also for the pointer regarding summary(modq), but that is already there, it just gets "boxed" on a normal screen, so one has to scroll down. — Momo, Jul 15 '14 at 15:15
I still wonder if it's possible that fewer variables get shrunken in Poisson than if there were a quasi-Poisson specification, which is more correct, and would likely lead to better predictive accuracy than the Poisson model because its sampling model is more correct. — Brash Equilibrium, Jul 18 '14 at 17:13
On that note, it could also be that more variables get shrunken in Poisson than should be shrunken in cases of UNDER-dispersion (such as when you are using a robust Poisson model to estimate relative risk ratios for 0/1 data). — Brash Equilibrium, Jul 18 '14 at 17:15
I'm not sure what you mean by "should" as elasticnet/LASSO is consistent for variable selection under (E)IC, but all depends on how you define the shrinkage. If you define it as in glmnet as −1/N l(β|X,Y)+shrinkage(β;λ,α) it depends only on β and it does not matter for β̂ whether l(β|X,Y) is the likelihood derived from the Poisson or the equivalent quasi-likelihood. If you define the penalty as shrinkage(β,θ;λ,α) then of course it would matter. It may be better to use another shrinkage, but in glmnet, all info about β is contained in the estimating equations. — Momo, Jul 19 '14 at 00:33

How does glmnet handle overdispersion?

1 Answers1

Linked