As part of an assignment, I was working with a dataset containing a breastcancer-related dependent variable (HG
) against three independent variables representing different risk factors.
head(data)
NV PI EH HG
1 0 13 1.64 0
2 0 16 2.26 0
3 0 8 3.14 0
4 0 34 2.68 0
5 0 20 1.28 0
6 0 5 2.31 0
The predictor variable NV
is the only categorial predictor variable and is highly statistically associated with HG using a fisher.test, but has a p-value close to 1 in a logistic regression (both in a model with all three predictors and in a model with only one predictor. I'm showing only the one with a single predictor here).
model_3 <- glm(formula = HG~NV, data = data, family = 'binomial')
summary(model_3)
Call:
glm(formula = HG ~ NV, family = "binomial", data = data)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.77180 -0.77180 -0.77180 0.00013 1.64708
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.0586 0.2815 -3.761 0.000169 ***
NV 19.6247 1809.0545 0.011 0.991345
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
The curious thing about NV
is that it is 'stratified': It is always 0 whenever HG
(the dependent variable) is 0:
table(data[, c(1, 4)])
HG
NV 0 1
0 49 17
1 0 13
Artificially adding a 1 to a random case where HG
is 0 yields a statistically significant p-value in the logistic regression.
data_CH <- data
data_CH$NV[sample(which(data_CH$HG==0), 1)] <- 1
model_CH <- glm(formula = HG~NV, data = data_CH, family = 'binomial')
summary(model_CH)
Could someone give me a hint of what's going on here?