So I've been doing a pretty basic logistic regression, where 1=default and 0=no default. I have a couple explanatory variables, some of which behave strangely. The model is:
log1 <- glm(target~dsti+NEW_customer+masterscale_pd+ mne +loan2inc+top_up+type_covered+
avg_remain_dur, data = df,family = binomial(link='logit') )
summary(log1)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 62.421450219 13.454315081 4.640 0.00000349 ***
dsti 2.797621531 0.721706775 3.876 0.000106 ***
NEW_customer1 0.217681547 0.191455352 1.137 0.255545
masterscale_pd -0.979473616 0.071318565 -13.734 < 0.0000000000000002 ***
mne -6.878542449 1.441856732 -4.771 0.00000184 ***
loan2inc 0.035920266 0.015238445 2.357 0.018413 *
top_up 0.000016906 0.000008054 2.099 0.035807 *
type_coveredR 1.192252183 0.289277803 4.121 0.00003764 ***
avg_remain_dur 1.302201156 0.333252387 3.908 0.00009324 ***
After looking at the p-values, my alarm went off; the variable NEW_customer
was one of the most important ones in previous analyses. It even separates the default ratios really well:
df %>% group_by(NEW_customer) %>% summarise(def = sum(target==1)/n())
# A tibble: 2 x 2
NEW_customer def
<fct> <dbl>
1 0 0.01
2 1 0.09
Meaning, 9% of new customers defaulted, whereas only 1.1% of old customers went bad. So I tried lasso regression with the help of cv.glmnet
function, where the variable NEW_customer
was the 2nd most important. After some googling, I came across question. Trying different test yielded different values:
anova(log1, test = 'LRT')
Df Deviance Resid. Df Resid. Dev Pr(>Chi)
NULL 18091 2705.4
dsti_residence 1 30.396 18090 2675.0 0.0000000352303 ***
NEW_customer 1 105.430 18089 2569.6 < 0.00000000000000022 ***
application_masterscale_pd 1 312.483 18088 2257.1 < 0.00000000000000022 ***
mne 1 38.566 18087 2218.6 0.0000000005292 ***
loan2inc 1 17.650 18086 2200.9 0.0000265518195 ***
top_up 1 2.421 18085 2198.5 0.11974
type_covered 1 4.209 18084 2194.3 0.04022 *
avg_remain_dur 1 16.339 18083 2177.9 0.0000529510572 ***
These p-values look much more reasonable. My dataset has ~15 000 observations, so I don't think the issue lies there. So what is the underlying problem that causes the Wald test to fail? I can not really wrap my head around it. Thanks for help!