Effect of Wald-test and collinearity on Logistic Regression model selection

Question

A researcher is interested in how variables, such as GRE (continuous), GPA (continuous) and rank of the undergraduate institution (categorical), affect admission into graduate school. The response variable, admit/don't admit, is a binary variable. The data set is taken from UCLA stats page.

admisdata <- read.csv("http://www.ats.ucla.edu/stat/data/binary.csv")
summary(admisdata) 
admisdata$rank <- factor(admisdata$rank)
mylogit <- glm(admit ~ gre + gpa + rank, family = "binomial"(link=logit), data = admisdata)

My questions follow:

1) In the code (below), they check whether there is a statistically significant difference between the rank3 and rank4 coefficients. What would the consequence be if the difference is not significant (as below)? Are we better off merging rank3 and rank4 or leaving one out?

l2 <- cbind(0, 0, 0, 0, 1, -1)  # rank3 with rank4
wald.test(b = coef(mylogit), Sigma = vcov(mylogit), L = l2)
>Wald test:
>Chi-squared test:
>X2 = 0.29, df = 1, P(> X2) = 0.59

2) In another list of heuristics, it is recommended to look for collinearities by checking the correlation matrix of the estimated coefficients. And it is stated: "If two covariates are highly correlated, do not need both of them in the model". For the given model fit:

cov2cor(vcov(mylogit))

>            (Intercept)          gre         gpa        rank2       rank3       rank4
>(Intercept)   1.0000000 -0.241538075 -0.80278632 -0.234145435 -0.12357608 -0.18775966
>gre          -0.2415381  1.000000000 -0.34207786 -0.004867914  0.04925080  0.02589326
>gpa          -0.8027863 -0.342077858  1.00000000  0.043045375 -0.08263837  0.02573691
>rank2        -0.2341454 -0.004867914  0.04304537  1.000000000  0.63655379  0.53030520
>rank3        -0.1235761  0.049250801 -0.08263837  0.636553788  1.00000000  0.48337703
>rank4        -0.1877597  0.025893262  0.02573691  0.530305204  0.48337703  1.00000000

It seems like the highest inter-coefficient correlation is between rank3 and rank2. Does that mean it is better to leave one of them out or merge them? How do we decide what correlation value is significant enough?

3) Or, should one prioritise looking at the AIC's of the different models with/without these categories to compare them instead of the issues listed in 1) and 2)?

First, what's wrong with the full model? What's it for? Does it perform badly under cross-validation? If it ain't broke don't fix it. — Scortchi - Reinstate Monica, Oct 31 '13 at 22:10
@Scortchi, I added a sentence that explains what the model is for. From the links given 1) is carried out before looking at the overall model performance (e.g. AIC) and my understanding is that 2) is a rule of thumb step to check for collinearity. What I am trying to get is whether my understanding is correct and what to do in 1) and 2) for the specific model presented. — Zhubarb, Nov 01 '13 at 08:41

Scortchi - Reinstate Monica · Accepted Answer · 2013-11-01T10:11:49.927

First, if the researcher's interested in how those variables affect admission, doesn't that interest include how much difference there is between third & fourth ranks?

(1) It's not a good idea to merge categories based just on their having similar responses—you're introducing bias into the coefficient estimates. Things like this are done ruefully, when they have to be, to fix problems with the model, not as a matter of course.

(2) Think this through. A high correlation between two predictors makes it difficult to separate their effects on the response. How many values of rank can a single person have? Are you ever going to be interested in predicting admission for someone having both rank=2 & rank=3? [Edit in response to comment: The answer is that a single person can only have one level of a categorical predictor, & you'll never be interested in predicting the response for someone with more than one level, so it's to be expected that there's correlation between the levels, & poses no problem at all. Neither leave one out (which would be equivalent to merging it with the reference level) nor merge them. (This is sometimes called structural multicollinearity to distinguish it from the problematic kind of multicollinearity).]

(3) Read the post you linked to carefully. Gung is saying that Akaike's Information Criterion can be useful to decide between a few candidate models "of substantive interest to you"; he's not recommending trawling through all possibilities to find the model with the lowest AIC.

Thank you, regarding (2) "*Does that mean it is better to leave one of them out or merge them?*". My question on the possibility of 'merging' refers to re-packaging the members of `rank2` and `rank3` to a new category `rank23` that is a union of the former two. So, I did not understand your response on this. — Zhubarb, Nov 01 '13 at 10:02

Effect of Wald-test and collinearity on Logistic Regression model selection

1 Answers1