I'm attempting logistic regression in R for a survey for 613 students. I'm looking to see if there is an association between my Dependent Variable (called 'BinaryShelter', coded as 0 or 1, signifying whether students took shelter during a tornado warning) and my 5 independent/predictor variables. My categorical IV's have anywhere from 3 to 11 distinct levels/categories within them. The other two IV's are binary coded as 0 or 1. The first 10 surveys and R output are given below:
Survey KSCat WSCat PlanHome PlanWork KLNKVulCat BinaryShelter
1 J B 1 1 A 1
2 A B 1 0 NA 1
3 B B 1 1 C 1
4 B D 1 1 A 0
5 B D 1 1 A 1
6 G E 1 1 A 0
7 A A 1 1 B 1
8 C F NA 1 C 0
9 B B 1 1 A 1
10 C B 0 0 NA 1
Call:
glm(formula = BinaryShelter ~ KSCat + WSCat + PlanHome + PlanWork +
KLNKVulCat, family = binomial("logit"), data = mydata)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.0583 -1.3564 0.7654 0.8475 1.6161
Coefficients:
Estimate St. Error z val Pr(>|z|)
(Intercept) 0.98471 0.43416 2.268 0.0233 *
KSCatB -0.63288 0.34599 -1.829 0.0674 .
KSCatC -0.14549 0.27880 -0.522 0.6018
KSCatD 0.59855 1.12845 0.530 0.5958
KSCatE 15.02995 1028.08167 0.015 0.9883
KSCatF 0.61015 0.68399 0.892 0.3724
KSCatG -1.60723 1.54174 -1.042 0.2972
KSCatH -1.57777 1.26621 -1.246 0.2127
KSCatI -2.06763 1.18469 -1.745 0.0809 .
KSCatJ -0.23560 0.65723 -0.358 0.7200
WSCatB -0.30231 0.28752 -1.051 0.2931
WSCatC -0.49467 1.26400 -0.391 0.6955
WSCatD 0.52501 0.71082 0.739 0.4601
WSCatE -0.32153 0.63091 -0.510 0.6103
WSCatF -0.51699 0.74680 -0.692 0.4888
WSCatG -0.64820 0.39537 -1.639 0.1011
WSCatH -0.05866 0.89820 -0.065 0.9479
WSCatI -17.07156 1455.39758 -0.012 0.9906
WSCatJ -16.31078 662.38939 -0.025 0.9804
PlanHome 0.27095 0.28121 0.964 0.3353
PlanWork 0.24983 0.24190 1.033 0.3017
KLNKVulCatB 0.17280 0.42353 0.408 0.6833
KLNKVulCatC -0.12551 0.24777 -0.507 0.6125
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 534.16 on 432 degrees of freedom
Residual deviance: 502.31 on 410 degrees of freedom
(180 observations deleted due to missingness)
AIC: 548.31
Number of Fisher Scoring iterations: 14
> Anova(ShelterYorN, Test = "LR")
Analysis of Deviance Table (Type II tests)
Response: BinaryShelter
LR Chisq Df Pr(>Chisq)
KSCat 13.3351 9 0.1480
WSCat 14.3789 9 0.1095
PlanHome 0.9160 1 0.3385
PlanWork 1.0583 1 0.3036
KLNKVulCat 0.7145 2 0.6996
My questions are:
1) Does a very large St. Deviation (like the one for KSCatE) indicate that I should not use that level of that categorical IV if I want the model to fit the data better? The ones that had such large St. Deviations were from small groups. Should I not include data from very small groups? For instance if only 2 or 3 people picked category 'E' for KSCat, should I exclude that data?
2) When using factors for my categorical data, or when adding in more than one IV, sometimes my beta coefficients flip signs. Does this mean I should test for interaction and then try to conduct some form of a PCA or jump straight to doing a PCA?
These next questions may be better asked on stack overflow, but I figured I'd give it a shot here:
3) I do not want a particular level of the categorical variables to be the reference level. I know that R automatically picks the reference level (A if letters, and the first one if numbers). As in the answer to this question (Significance of categorical predictor in logistic regression), I tried fitting the model without an intercept by adding - 1 to the formula to see all coefficients directly. But when I do this, the results only show the 'A' level of the first variable and none of the others. For example, I can see results for 'KSCatA' but not 'WSCatA' or 'KLNKVulCatA'.
4) How does R handle missing observations for logistic regression? For example survey #10 was missing the 'KLNKVulCat' Variable, but not any of the other IV's. Would R or any other statistical languages not use any of the information for this particular person, or just that particular variable?
Any help is greatly appreciated, thank you.