Logistics regression decision categorical variable makes huge difference in Kaggle-Titanic problem in R

Question

I have this linear model:

fit = glm(Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare, 
          data=passengers, family=binomial)
summary(fit) 

Coefficients:
         Estimate Std. Error z value Pr(>|z|)    
(Intercept)  5.318162   0.571693   9.302  < 2e-16 ***
Pclass      -1.175648   0.145979  -8.054 8.04e-16 ***
Sexmale     -2.760823   0.199952 -13.807  < 2e-16 ***
Age         -0.043866   0.008220  -5.336 9.49e-08 ***
SibSp       -0.428252   0.106963  -4.004 6.24e-05 ***
Parch       -0.099051   0.118328  -0.837    0.403    
Fare         0.002587   0.002362   1.095    0.274

So according to the Z score I can say that the decision variables Pclass, Sexmale, Age, SibSp deal a huge part in deciding whether the depending variable will be 0 or 1.

Now I change the fit to the following (added Title):

fit = glm(Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Title, 
          data=passengers, family=binomial)
summary(fit)

Coefficients:
          Estimate Std. Error z value Pr(>|z|)    
(Intercept)  3.458e+01  1.906e+03   0.018  0.98552    
Pclass      -1.108e+00  1.556e-01  -7.123 1.06e-12 ***
Sexmale     -3.180e+01  1.906e+03  -0.017  0.98669    
Age         -3.110e-02  9.574e-03  -3.249  0.00116 ** 
SibSp       -6.079e-01  1.250e-01  -4.862 1.16e-06 ***
Parch       -3.614e-01  1.346e-01  -2.685  0.00726 ** 
Fare         4.168e-03  2.573e-03   1.620  0.10531    
TitleDr     -6.257e-01  1.691e+00  -0.370  0.71142    
TitleLady   -1.605e+01  1.452e+03  -0.011  0.99118    
TitleMaster  2.541e+00  1.552e+00   1.637  0.10163    
TitleMiss   -2.982e+01  1.906e+03  -0.016  0.98752    
TitleMlle   -1.642e+01  2.356e+03  -0.007  0.99444    
TitleMr     -9.856e-01  1.439e+00  -0.685  0.49339    
TitleMrs    -2.904e+01  1.906e+03  -0.015  0.98784    
TitleMs     -1.498e+01  3.064e+03  -0.005  0.99610    
TitleRev    -1.576e+01  9.616e+02  -0.016  0.98692    
TitleSir    -4.003e-01  1.705e+00  -0.235  0.81438

So now my model is much "weaker" in a sense that I have more weak decision variables and the Sexmale is also "overshadowed" with the categorical Title variable.

Could someone tell me the reason? Also, please help me deciding if the new model is really weaker than the previous one or am I missing some fundamental thing here?

Why don't you run cross validation and see what model gives better prediction? — ffriend, Aug 28 '14 at 22:42
I will but testing the model comes after fitting it and I want to understand why are the scores so different? Or maybe they don't really matter after all? — SLOBY, Aug 28 '14 at 22:45
As you stated yourself, variables are not really independent and some of them "shadow" others. Sex itself is a strong predictor, but in presence of more fine-grained Title* variables it loses some of its importance. However, many "weaker" variables may give better results than few "stronger". That's why I proposed to evaluate models first. In general, if variables are highly correlated, coefficients may not really reflect importance of features. — ffriend, Aug 28 '14 at 23:03
above is a good point. but the use of "weaker" is also incorrect, or rather I don't understand the logic behind its use here. You've provided no measures of model performance, how are you comparing models? — charles, Aug 28 '14 at 23:06
Ok, so based on the above 2 models *without* measuring, what exactly can we infer? — SLOBY, Aug 28 '14 at 23:08
(1) I'd look at how coefficients change. The Greenland (10% to) 20% rule of thumb isn't unreasonable for confounding. (2) It does look like Title and MaleSex are correlated. Interesting to see how the SE inflates with the addition of Title (essentially you're doing a VIF? - there might be a function for this in R). (3) But whether the addition of Title improves or weakens model you can't tell. (4) I'm not sure how big the sample size is, you're adding a few degrees of freedom — charles, Aug 28 '14 at 23:16
"The Greenland (10% to) 20% rule of thumb isn't unreasonable for confounding." -> I'm sorry but I didn't get this at all, what did you wanted to say here? You mean the SE of SexMale, right? Sample size consists of 891 rows so I'd say it's enough — SLOBY, Aug 29 '14 at 08:38

score 4 · Accepted Answer · edited Apr 13 '17 at 12:44

4

This is a version of a frequently asked question. The answer is that Title is correlated with the other variables that are already in the model. As a result, the estimates for the variables in the model can change when Title is added. The basic story is discussed in many places on CV, but if you want a generic introduction, it may help you to read my answer here: Is there a difference between 'controlling for' and 'ignoring' other variables in multiple regression?

edited Apr 13 '17 at 12:44

Community

1

answered Aug 28 '14 at 23:15

gung - Reinstate Monica

132,789
81
357
650

Thank you, the answer you linked was very helpful. The indicator for correlation is the fact that adding Title in the model greatly affected other variable's z score rigth? Still I'm confused about the z score then. Should I care about it? What should indicate in my case that I _might_ get a better fit by using the second fit? – SLOBY Aug 29 '14 at 08:35
If you want to know if the fit is better, take a look at measures of fit for both models. A common measure is the area under the ROC curve (AUC). If you are wondering about the models' out of sample performance, you can cross-validate to get estimates of out of sample AUCs. I wouldn't worry too much about the z-scores in either case. – gung - Reinstate Monica Aug 29 '14 at 15:25

Logistics regression decision categorical variable makes huge difference in Kaggle-Titanic problem in R

1 Answers1