why did the subset and factor influenced coefficients of logistic regression in R

Question

The coefficients changed a lot when I used all the factor levels versus when I limited to only one level of a factor as a subset of the data.

I am trying to do a logistic regression between the disease and contact exposure. There were several different sites, so I use the factor function (model:ml1). I also tried to focus on only a specific site:WB to analyze the association, which site was used as the subset of the data (model:ml2).

ml1<-glm(disease~x+**factor(site)**+factor(anycontact) +factor(comecat), data=gianalysis_bd, family= binomial )
summary(ml1)

Coefficients:
                    Estimate Std. Error z value Pr(>|z|)    
(Intercept)         -3.44400    0.25761 -13.369  < 2e-16 ***
x                    0.24559    0.08309   2.956 0.003121 **
factor(site)FB      0.03967    0.15177   0.261 0.793792    
factor(site)GB     -0.54896    0.16538  -3.319 0.000902 ***
factor(site)HB      0.39635    0.14699   2.696 0.007010 **
factor(site)SB     -0.13887    0.14347  -0.968 0.333069    
factor(site)WB     -0.06200    0.14647  -0.423 0.672067    
factor(site)WP     -0.03706    0.15388  -0.241 0.809677    
**factor(anycontact)1  0.40856**    0.06846   5.968 2.41e-09 ***
factor(comecat)2     0.02260    0.07184   0.315 0.753037    
factor(comecat)3     0.11195    0.07574   1.478 0.139405

ml2<-glm(disease~x+factor(anycontact) +factor(comecat), data=gianalysis_bd, **subset=site=="WB"**, family= binomial )
summary(ml2)

Coefficients:
                    Estimate Std. Error z value Pr(>|z|)    
(Intercept)          -3.4016     0.4347  -7.825 5.06e-15 ***
x                     0.1421     0.1454   0.977  0.32834    
**factor(anycontact)1   0.7380**     0.2590   2.850  0.00438 **
factor(comecat)2     -0.4049     0.2042  -1.983  0.04738 *  
factor(comecat)3      0.1136     0.2182   0.520  0.60273

However, the coefficient of factor(anycontact) changed significantly, increasing from 0.4085 (ml1) to 0.7380. I could not tell why that happened (I think it should be the same in both the models). Can someone help to explain the difference between the two model and the reason? Thank you very much.

can you please reword the question to be clear that infact you're training 2 different models, one specifically for "WB" and another across all sites. — behold, Apr 20 '19 at 15:15

score 2 · Answer 1 · answered Apr 20 '19 at 15:56

Without knowing more about the details of your data it's hard to say precisely what's going on in your case, but here are 2 possibilities.

First, omitting predictors in any regression model that are correlated with the included predictors can even go so far as to reverse the signs of the coefficients for the included predictors, as in Simpson's paradox.

Second, omitting any predictor related to outcome in models like logistic or Cox proportional hazards regression can lead to bias in coefficient values, even if it is not correlated with the included predictors. This answer provides an analytic demonstration for a similar approach, probit modeling.

In your example, not only did the coefficient for anycontact1 change from the full model when analysis was restricted to the subset, but so did the values and apparent significance of coefficients for x and factor(comecat)2. I suspect that the reasons for these differences lie in some combination of the correlations among these predictors and how they might change between the entire data set and the subset.

score 0 · Answer 2 · answered Apr 20 '19 at 14:58

I think it makes sense for site "WB" specific model to be different from a model for all sites combined.

Looks like, in terms of sites, there are 3 combinations "HB", "GB" and "Not HB/GB".

Only HB and GB are significant with low p values.

I think if you run the regression for "Not HB/GB" it should yield you a model similar to what you fitted only for "WB". Can you try that and post?

why did the subset and factor influenced coefficients of logistic regression in R

2 Answers2

Linked