A researcher is interested in how variables, such as GRE (continuous), GPA (continuous) and rank of the undergraduate institution (categorical), affect admission into graduate school. The response variable, admit/don't admit, is a binary variable. The data set is taken from UCLA stats page.
admisdata <- read.csv("http://www.ats.ucla.edu/stat/data/binary.csv")
summary(admisdata)
admisdata$rank <- factor(admisdata$rank)
mylogit <- glm(admit ~ gre + gpa + rank, family = "binomial"(link=logit), data = admisdata)
My questions follow:
1) In the code (below), they check whether there is a statistically significant difference between the rank3
and rank4
coefficients. What would the consequence be if the difference is not significant (as below)? Are we better off merging rank3
and rank4
or leaving one out?
l2 <- cbind(0, 0, 0, 0, 1, -1) # rank3 with rank4
wald.test(b = coef(mylogit), Sigma = vcov(mylogit), L = l2)
>Wald test:
>Chi-squared test:
>X2 = 0.29, df = 1, P(> X2) = 0.59
2) In another list of heuristics, it is recommended to look for collinearities by checking the correlation matrix of the estimated coefficients. And it is stated: "If two covariates are highly correlated, do not need both of them in the model". For the given model fit:
cov2cor(vcov(mylogit))
> (Intercept) gre gpa rank2 rank3 rank4
>(Intercept) 1.0000000 -0.241538075 -0.80278632 -0.234145435 -0.12357608 -0.18775966
>gre -0.2415381 1.000000000 -0.34207786 -0.004867914 0.04925080 0.02589326
>gpa -0.8027863 -0.342077858 1.00000000 0.043045375 -0.08263837 0.02573691
>rank2 -0.2341454 -0.004867914 0.04304537 1.000000000 0.63655379 0.53030520
>rank3 -0.1235761 0.049250801 -0.08263837 0.636553788 1.00000000 0.48337703
>rank4 -0.1877597 0.025893262 0.02573691 0.530305204 0.48337703 1.00000000
It seems like the highest inter-coefficient correlation is between rank3
and rank2
. Does that mean it is better to leave one of them out or merge them? How do we decide what correlation value is significant enough?
3) Or, should one prioritise looking at the AIC's of the different models with/without these categories to compare them instead of the issues listed in 1) and 2)?