0

So I've been doing a pretty basic logistic regression, where 1=default and 0=no default. I have a couple explanatory variables, some of which behave strangely. The model is:

log1 <- glm(target~dsti+NEW_customer+masterscale_pd+ mne +loan2inc+top_up+type_covered+
              avg_remain_dur, data = df,family = binomial(link='logit') )
summary(log1)

Coefficients:
                               Estimate   Std. Error z value             Pr(>|z|)    
(Intercept)                62.421450219 13.454315081   4.640           0.00000349 ***
dsti                        2.797621531  0.721706775   3.876             0.000106 ***
NEW_customer1               0.217681547  0.191455352   1.137             0.255545    
masterscale_pd             -0.979473616  0.071318565 -13.734 < 0.0000000000000002 ***
mne                        -6.878542449  1.441856732  -4.771           0.00000184 ***
loan2inc                    0.035920266  0.015238445   2.357             0.018413 *  
top_up                      0.000016906  0.000008054   2.099             0.035807 *  
type_coveredR               1.192252183  0.289277803   4.121           0.00003764 ***
avg_remain_dur              1.302201156  0.333252387   3.908           0.00009324 ***

After looking at the p-values, my alarm went off; the variable NEW_customer was one of the most important ones in previous analyses. It even separates the default ratios really well:

df %>% group_by(NEW_customer) %>% summarise(def = sum(target==1)/n())
# A tibble: 2 x 2
  NEW_customer    def
  <fct>         <dbl>
1 0            0.01
2 1            0.09

Meaning, 9% of new customers defaulted, whereas only 1.1% of old customers went bad. So I tried lasso regression with the help of cv.glmnet function, where the variable NEW_customer was the 2nd most important. After some googling, I came across question. Trying different test yielded different values:

anova(log1, test = 'LRT')

                           Df Deviance Resid. Df Resid. Dev              Pr(>Chi)    
NULL                                       18091     2705.4                          
dsti_residence              1   30.396     18090     2675.0       0.0000000352303 ***
NEW_customer                1  105.430     18089     2569.6 < 0.00000000000000022 ***
application_masterscale_pd  1  312.483     18088     2257.1 < 0.00000000000000022 ***
mne                         1   38.566     18087     2218.6       0.0000000005292 ***
loan2inc                    1   17.650     18086     2200.9       0.0000265518195 ***
top_up                      1    2.421     18085     2198.5               0.11974    
type_covered                1    4.209     18084     2194.3               0.04022 *  
avg_remain_dur              1   16.339     18083     2177.9       0.0000529510572 ***

These p-values look much more reasonable. My dataset has ~15 000 observations, so I don't think the issue lies there. So what is the underlying problem that causes the Wald test to fail? I can not really wrap my head around it. Thanks for help!

PK1998
  • 159
  • 6
  • 2
    `anova()` gives *sequential* tests (="Type I"), `summary()` gives *marginal* tests (="Type II/III"). Check `cov2cor(vcov(log1))`, you probably have some correlations among the estimates ... – Ben Bolker Nov 04 '19 at 14:44
  • Indeed, I have strong correlation between the intercept and `mne` coefficient, which is -0.999. What could that mean? Also, after removing the intercept, I have now two coefficients for `NEW_customer`: `NEW_customer0` and `NEW_customer1`, which is really strange and I don't understand the presence of `NEW_customer0`. – PK1998 Nov 05 '19 at 07:17
  • that isn't the relevant correlation. Correlations with the intercept just mean that the center of your data is far from zero. I mean that some other parameter or parameters are correlated with `NEW_customer1`. The problem you are having is a pretty fundamental one in regression/linear modeling, whenever predictor variables are non-orthogonal. See https://stats.stackexchange.com/questions/14522/variable-order-and-accounted-variability-in-linear-mixed-effects-modeling for a better discussion. – Ben Bolker Nov 05 '19 at 14:12

0 Answers0