Negative Binomial GLM confusing effect on p value of model by changing theta

Question

I am interested in generating a p-value for two small groups of count based data that I believe to show overdispersion. The groups are for example:

library(MASS)
set.seed(106)
# True mean for treated is 100 and control is 10.
counts <- c(rnegbin(4, 100, 0.5), rnegbin(4, 10, 0.5))
condition <- c(rep("treated", 4), rep("control", 4))
example_df <- data.frame(counts, condition)
example_df
counts  condition
4       treated
9       treated         
111     treated         
31      treated         
13      control         
1       control         
7       control         
9       control

This is just a toy example, from real data it would be difficult to estimate dispersion from just 8 values, so for my real data I want to use a wide range of theta values to determine how overdispersed the data would need to be to generate p values above the common 0.05 threshold.

My idea was to model the counts using the negative binomial distribution with different theta values.

As I understand it theta is an estimate of dispersion calculated as:

This is my issue and where I think I have a mistake in my thinking:

As theta increases variance should then be closer to the mean. I would expect this to mean that the regression should show a lower p value as low variance would result in a larger difference between the means of the control and the treated. Yet the opposite is observed:

glm(formula = counts ~ condition, family = negative.binomial(theta = 100), 
data = example_df)
summary(neg_bin_fit)

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)  
(Intercept)            2.0149     0.9352   2.155   0.0746 .
conditiontreated       1.6422     1.0455   1.571   0.1673 

neg_bin_fit <- glm(counts~condition, family=negative.binomial(theta=.1), data=example_df)


Coefficients:
                 Estimate Std. Error t value Pr(>|t|)   
(Intercept)        2.0149     0.5122   3.934  0.00768 **
conditiontreated   1.6422     0.7224   2.273  0.06339 .

I may be thinking of this wrong, should I model the treated and control counts separately and then use something like anova to compare the results?

Thank you

score 1 · Answer 1 · answered Mar 15 '20 at 11:56

I would not suggest modelling the counts from the control and treatment branches separately. Treating within the same model is perfectly reasonable conceptually in this case as well as very beneficial computationally if we do not have a huge sample.

I think that your basic understanding about how $\theta$ works is fine. What I think is misinterpreted is the fact that a lower $p$-value is directly associated with a lower (or greater) $\theta$. The $p$-value associated with the $\beta$ coefficient of the treatment effect is related to the success of the Fischer scoring algorithm doing the fitting, i.e. the negative binomial model with parameters that solve the maximum likelihood equation. This MLE model does not necessary have the "smallest $p$-values", it "only" has the maximum likelihood. If we use a routine like glm.nb we will iteratively estimate both the $\beta$ coefficients as well as the $\theta$ parameter that are associated with the MLE. We could then use AIC or other information criteria to also examine if it is optimal or not. (Because we fitted using MLE, by definition, for that set of covariates $X$, that $\theta$ will be the optimal one. ) The aptly named CV.SE thread: How does glm.nb work? gives more details on how glm.nb works.

Negative Binomial GLM confusing effect on p value of model by changing theta

1 Answers1