stratified binary predictor in logistic regression

Question

As part of an assignment, I was working with a dataset containing a breastcancer-related dependent variable (HG) against three independent variables representing different risk factors.

head(data)

  NV PI EH   HG
1  0 13 1.64  0
2  0 16 2.26  0
3  0  8 3.14  0
4  0 34 2.68  0
5  0 20 1.28  0
6  0  5 2.31  0

The predictor variable NV is the only categorial predictor variable and is highly statistically associated with HG using a fisher.test, but has a p-value close to 1 in a logistic regression (both in a model with all three predictors and in a model with only one predictor. I'm showing only the one with a single predictor here).

model_3 <- glm(formula = HG~NV, data = data, family = 'binomial') 
summary(model_3)

Call:
glm(formula = HG ~ NV, family = "binomial", data = data)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-0.77180  -0.77180  -0.77180   0.00013   1.64708  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)   -1.0586     0.2815  -3.761 0.000169 ***
NV            19.6247  1809.0545   0.011 0.991345    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

The curious thing about NV is that it is 'stratified': It is always 0 whenever HG (the dependent variable) is 0:

table(data[, c(1, 4)])

Artificially adding a 1 to a random case where HG is 0 yields a statistically significant p-value in the logistic regression.

data_CH <- data
data_CH$NV[sample(which(data_CH$HG==0), 1)] <- 1
model_CH <- glm(formula = HG~NV, data = data_CH, family = 'binomial')
summary(model_CH)

Could someone give me a hint of what's going on here?

It turns out that this is probably a case of 'quasi-complete separation' (https://en.wikipedia.org/wiki/Separation_(statistics)). — PejoPhylo, Apr 10 '19 at 07:56
Do you expect that `HG-==0` abd `NV==0` is simply a artifact of your sample, or true in general for the population ? See [here](https://stats.stackexchange.com/questions/11109/how-to-deal-with-perfect-separation-in-logistic-regression) for strategies to handle this. — Robert Long, Apr 10 '19 at 08:08
I do not know enough about the biology behind this dataset to give an answer to your first question. Do you have an idea why I didn't see the warning mentioned in the post you linked to? — PejoPhylo, Apr 10 '19 at 08:15

stratified binary predictor in logistic regression

0 Answers0