I am interested in generating a p-value for two small groups of count based data that I believe to show overdispersion. The groups are for example:
library(MASS)
set.seed(106)
# True mean for treated is 100 and control is 10.
counts <- c(rnegbin(4, 100, 0.5), rnegbin(4, 10, 0.5))
condition <- c(rep("treated", 4), rep("control", 4))
example_df <- data.frame(counts, condition)
example_df
counts condition
4 treated
9 treated
111 treated
31 treated
13 control
1 control
7 control
9 control
This is just a toy example, from real data it would be difficult to estimate dispersion from just 8 values, so for my real data I want to use a wide range of theta values to determine how overdispersed the data would need to be to generate p values above the common 0.05 threshold.
My idea was to model the counts using the negative binomial distribution with different theta values.
As I understand it theta is an estimate of dispersion calculated as:
This is my issue and where I think I have a mistake in my thinking:
As theta increases variance should then be closer to the mean. I would expect this to mean that the regression should show a lower p value as low variance would result in a larger difference between the means of the control and the treated. Yet the opposite is observed:
glm(formula = counts ~ condition, family = negative.binomial(theta = 100),
data = example_df)
summary(neg_bin_fit)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.0149 0.9352 2.155 0.0746 .
conditiontreated 1.6422 1.0455 1.571 0.1673
neg_bin_fit <- glm(counts~condition, family=negative.binomial(theta=.1), data=example_df)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.0149 0.5122 3.934 0.00768 **
conditiontreated 1.6422 0.7224 2.273 0.06339 .
I may be thinking of this wrong, should I model the treated and control counts separately and then use something like anova to compare the results?
Thank you