4 specific questions about Logistic Regression with Categorical Predictors in R

Question

I'm attempting logistic regression in R for a survey for 613 students. I'm looking to see if there is an association between my Dependent Variable (called 'BinaryShelter', coded as 0 or 1, signifying whether students took shelter during a tornado warning) and my 5 independent/predictor variables. My categorical IV's have anywhere from 3 to 11 distinct levels/categories within them. The other two IV's are binary coded as 0 or 1. The first 10 surveys and R output are given below:

    Survey  KSCat   WSCat   PlanHome    PlanWork    KLNKVulCat  BinaryShelter
    1       J       B       1           1           A           1
    2       A       B       1           0           NA          1
    3       B       B       1           1           C           1
    4       B       D       1           1           A           0
    5       B       D       1           1           A           1
    6       G       E       1           1           A           0
    7       A       A       1           1           B           1
    8       C       F       NA          1           C           0
    9       B       B       1           1           A           1
    10      C       B       0           0           NA          1



Call:
glm(formula = BinaryShelter ~ KSCat + WSCat + PlanHome + PlanWork + 
KLNKVulCat, family = binomial("logit"), data = mydata)

Deviance Residuals: 
Min       1Q   Median       3Q      Max  
-2.0583  -1.3564   0.7654   0.8475   1.6161  

Coefficients:
              Estimate   St. Error  z val   Pr(>|z|)  
(Intercept)    0.98471    0.43416   2.268   0.0233 *
KSCatB        -0.63288    0.34599  -1.829   0.0674 .
KSCatC        -0.14549    0.27880  -0.522   0.6018  
KSCatD         0.59855    1.12845   0.530   0.5958  
KSCatE        15.02995 1028.08167   0.015   0.9883  
KSCatF         0.61015    0.68399   0.892   0.3724  
KSCatG        -1.60723    1.54174  -1.042   0.2972  
KSCatH        -1.57777    1.26621  -1.246   0.2127  
KSCatI        -2.06763    1.18469  -1.745   0.0809 .
KSCatJ        -0.23560    0.65723  -0.358   0.7200  
WSCatB        -0.30231    0.28752  -1.051   0.2931  
WSCatC        -0.49467    1.26400  -0.391   0.6955  
WSCatD         0.52501    0.71082   0.739   0.4601  
WSCatE        -0.32153    0.63091  -0.510   0.6103  
WSCatF        -0.51699    0.74680  -0.692   0.4888  
WSCatG        -0.64820    0.39537  -1.639   0.1011  
WSCatH        -0.05866    0.89820  -0.065   0.9479  
WSCatI       -17.07156 1455.39758  -0.012   0.9906  
WSCatJ       -16.31078  662.38939  -0.025   0.9804  
PlanHome       0.27095    0.28121   0.964   0.3353  
PlanWork       0.24983    0.24190   1.033   0.3017  
KLNKVulCatB    0.17280    0.42353   0.408   0.6833  
KLNKVulCatC   -0.12551    0.24777  -0.507   0.6125  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 534.16  on 432  degrees of freedom
Residual deviance: 502.31  on 410  degrees of freedom
  (180 observations deleted due to missingness)
AIC: 548.31

Number of Fisher Scoring iterations: 14

> Anova(ShelterYorN, Test = "LR")
Analysis of Deviance Table (Type II tests)

Response: BinaryShelter
          LR Chisq Df Pr(>Chisq)
KSCat       13.3351  9     0.1480
WSCat       14.3789  9     0.1095
PlanHome     0.9160  1     0.3385
PlanWork     1.0583  1     0.3036
KLNKVulCat   0.7145  2     0.6996

My questions are:

1) Does a very large St. Deviation (like the one for KSCatE) indicate that I should not use that level of that categorical IV if I want the model to fit the data better? The ones that had such large St. Deviations were from small groups. Should I not include data from very small groups? For instance if only 2 or 3 people picked category 'E' for KSCat, should I exclude that data?

2) When using factors for my categorical data, or when adding in more than one IV, sometimes my beta coefficients flip signs. Does this mean I should test for interaction and then try to conduct some form of a PCA or jump straight to doing a PCA?

These next questions may be better asked on stack overflow, but I figured I'd give it a shot here:

3) I do not want a particular level of the categorical variables to be the reference level. I know that R automatically picks the reference level (A if letters, and the first one if numbers). As in the answer to this question (Significance of categorical predictor in logistic regression), I tried fitting the model without an intercept by adding - 1 to the formula to see all coefficients directly. But when I do this, the results only show the 'A' level of the first variable and none of the others. For example, I can see results for 'KSCatA' but not 'WSCatA' or 'KLNKVulCatA'.

4) How does R handle missing observations for logistic regression? For example survey #10 was missing the 'KLNKVulCat' Variable, but not any of the other IV's. Would R or any other statistical languages not use any of the information for this particular person, or just that particular variable?

Any help is greatly appreciated, thank you.

4 specific questions about Logistic Regression with Categorical Predictors in R

0 Answers0