I want to compare different Count Data Models. How does one compute NEGBIN type I and NEGBIN type II in R?
Which model is estimated by glm.nb()?
Thanks in advance
I want to compare different Count Data Models. How does one compute NEGBIN type I and NEGBIN type II in R?
Which model is estimated by glm.nb()?
Thanks in advance
Are you referring to regression models for count data? If yes, I think your best bet is to use the R packages gamlss and gamlss.dist.
The gamlss package contains a gamlss() function which you can use to fit your regression models.
The gamlss.dist package contains two functions which implement the distributions you need:
The help files for the NBI() and NBII() functions provide details on the actual parametrizations they use. According to these help files, the corresponding parametrizations are such that the mean and variance of a variable having a negative binomial distribution are given by:
NBI(): mean = $\mu$ and variance = $\mu + \sigma\mu^2$
NBII(): mean = $\mu$ and variance = $\mu + \sigma\mu$
In contrast, the glm.nb() function in the MASS package uses a parametrization for which the mean and variance of a variable having a negative binomial distribution are given by:
mean = $\mu$ and variance = $\mu + \mu^2/\theta$.
If we denote $1/\theta$ by $\sigma$ in the variance parametrization used by glm.nb(), we recover the same variance parametrization as that used by the NBI() function: $\mu + \mu^2/\theta$ = $\mu + \sigma\mu^2$.
Note that the default of $glm.nb()$ is to model $log(\mu)$ and report the estimated value of $\theta$. On the other hand, using gamlss() with the option family=NBI will amount to modelling $log(\mu)$ and reporting the estimated value of $log(\sigma)$. If you compute the estimated value of $log(1/\theta)$ based on the summary of the glm.nb model, you should get the estimated value of $log(\sigma)$ reported in the summary of the model obtained by using gamlss() with the option family=NBI.
To fit a negative binomial regression model within the gamlss framework, you would use R code like the one below:
# NBI
model.NBI <- gamlss(y ~ x1 + x2 ,data=data,family=NBI)
# NBII
model.NBII <- gamlss(y ~ x1 + x2 ,data=data,family=NBII)
For glm.nb, the model would look like this:
model.nb <- glm.nb(y ~ x1 + x2 ,data=data)
There's a comment elsewhere on this forum (see Why do different negative binomial regression functions produce different coefficients, p-values) which states:
"From memory, for some reason gamlss uses NBI to refer to the distribution commonly called NB2 (quadratic mean-variance relationship), and NBII for the distribution commonly called NB1 (linear mean-variance relationship). glm.nb fits NB2 only. Try using family = NBI in gamlss. – Mark May 11 '17 at 23:50"
This comment acts as a reminder to read the gamlss documentation really carefully to make sure you fully understand what parametrizations are used for the two versions of the Negative Binomial distributions.
Here's an example using data pulled from https://stats.idre.ucla.edu/r/dae/negative-binomial-regression/ which shows the connection between glm.nb and gamlss with NBI parametrization:
require(foreign)
dat <- read.dta("https://stats.idre.ucla.edu/stat/stata/dae/nb_data.dta")
dat <- within(dat, {
prog <- factor(prog, levels = 1:3,
labels = c("General", "Academic", "Vocational"))
id <- factor(id)
})
#--- glm.nb
m1 <- glm.nb(daysabs ~ math + prog, data = dat)
summary(m1)
#--- gamlss with NBI family
require(gamlss)
require(gamlss.dist)
m1nb1 <- gamlss(daysabs ~ math + prog, data = dat, family=NBI)
summary(m1nb1)
The summary of the glm.nb model for these data is as follows:
> summary(m1)
Call:
glm.nb(formula = daysabs ~ math + prog,
data = dat,
init.theta = 1.032713156,
link = log)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.1547 -1.0192 -0.3694 0.2285 2.5273
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.615265 0.197460 13.245 < 0.0000000000000002 ***
math -0.005993 0.002505 -2.392 0.0167 *
progAcademic -0.440760 0.182610 -2.414 0.0158 *
progVocational -1.278651 0.200720 -6.370 0.000000000189 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for Negative Binomial(1.0327) family taken to be 1)
Null deviance: 427.54 on 313 degrees of freedom
Residual deviance: 358.52 on 310 degrees of freedom
AIC: 1741.3
Number of Fisher Scoring iterations: 1
Theta: 1.033
Std. Err.: 0.106
2 x log-likelihood: -1731.258
The summary of the gamlss model with the NBI family option is:
> summary(m1nb1)
******************************************************************
Family: c("NBI", "Negative Binomial type I")
Call: gamlss(formula = daysabs ~ math + prog, family = NBI, data = dat)
Fitting method: RS()
------------------------------------------------------------------
Mu link function: log
Mu Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.615322 0.196364 13.319 < 0.0000000000000002 ***
math -0.005994 0.002507 -2.391 0.0174 *
progAcademic -0.440756 0.182579 -2.414 0.0164 *
progVocational -1.278743 0.201968 -6.331 0.000000000855 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
------------------------------------------------------------------
Sigma link function: log
Sigma Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.03219 0.10279 -0.313 0.754
------------------------------------------------------------------
No. of observations in the fit: 314
Degrees of Freedom for the fit: 5
Residual Deg. of Freedom: 309
at cycle: 2
Global Deviance: 1731.258
AIC: 1741.258
SBC: 1760.005
******************************************************************
If you take the estimated Theta value of 1.033 reported by glm.nb and compute log(1/Theta) = log(1/1.033) = -0.03246719, which is close to what gamlss reports for log(sigma).
This Cross Validated forum answer by Hilbe is also illuminating: What is theta in a negative binomial regression fitted with R?
In it, Hilbe explains that glm.nb uses the indirect parametrization for expressing the variance of a variable having a negative binomial distribution as a quadratic function of the mean value of that variable:
variance = $\mu$ + $\mu^2/\theta$.
The direct parametrization corresponding to the same setting would be:
variance = $\mu$ + $\alpha\mu^2$.
Note that Hilbe uses the notation $\alpha$ instead of the notation $\sigma$ used in gamlss. In his notation, $\alpha = 1/\theta$.
In his Negative Binomial book, Joseph Hilbe states that "NB2 is the standard form of negative binomial used to estimate data that are Poisson-overdispersed, and is the form of the model which most statisticians understand by negative binomial. NB2 is typically the first model we turn to when we discover that a Poisson model is overdispersed." (You will need to read Hilbe's book to understand what parametrization he had in mind when making this statement.)