Including multiple categorical predictors that are not independent in linear regression

Question

I'm working with a model where we are studying a marker for a disease as well as risk factors, demographic variables, and related conditions.

Marker: Marker concentration (continuous)
Age: continuous
Case/Control: Disease State
Gender: M/F
ARMS2rs10490924: genetic risk factor
CFHrs1061170: genetic risk factor
CFHrs10737680: genetic risk factor
SKIV2Lrs429608: genetic risk factor
CNV_Either_Eye: related disease---possible precursor
GA_No_CNV_Either_Eye: related disease---possible precursor
AREDS: Treatment that works, but may affect marker concentration

Many of the categorical variables are highly correlated.

Here is a heatmap of the p-values from $\chi^2$ tests between pairs of variables:

Additionally, the AREDS variable is correlated with age because it is an age-related disease and very few control subjects are using the treatment (prophylactically) while almost all of the disease subjects are.

I want to perform a linear regression such as:

lm(MarkerConcentration ~ Case + Age + ... <categorical variables> ...)

What should be my strategy to deal with the colinearity between my variables?

EDIT: Here is a typical VIF:

> fit = lm(Shannon ~ CaseString + Age + Gender + ARMS2rs10490924 + CFHrs1061170 + CFHrs10737680 + SKIV2Lrs429608 + CNV_Either_Eye + GA_No_CNV_Either_Eye + AREDS, data=all_master_table)
> vif(fit)
                         GVIF Df GVIF^(1/(2*Df))
CaseString           6.774346  1        2.602757
Age                  1.356645  1        1.164751
Gender               1.215620  1        1.102552
ARMS2rs10490924      1.397811  2        1.087332
CFHrs1061170         3.489505  2        1.366756
CFHrs10737680        3.390033  2        1.356910
SKIV2Lrs429608       1.268326  2        1.061226
CNV_Either_Eye       5.174471  1        2.274746
GA_No_CNV_Either_Eye 2.740709  1        1.655509
AREDS                2.325022  1        1.524802

Are your predictors collinear? Correlation, even quite high, is not synonymic of (multi)collinearity. — ttnphns, Jul 29 '19 at 19:20
The only way I know of to test for multi(collinearity) in categorical variables is by using a $\chi^2$ test. E.g. https://stats.stackexchange.com/a/213805/141304. Is there a better way? — abalter, Jul 29 '19 at 19:29
Did you check for presence of collinearity using the vif() function in the car package? Just apply this function to your model and see how big the reported VIF values are for your predictors. — Isabella Ghement, Jul 29 '19 at 19:41
@IsabellaGhement I edited the post to include the output of `vif`. The variable with the highest value is the KEY variable in our study, disease state, so we can't really eliminate it from the regression! Suggestions? — abalter, Jul 29 '19 at 22:14
Do you mean Case? What happens if you exclude CNV_Either_Eye from the model? Does the VIF for Case go down? — Isabella Ghement, Jul 29 '19 at 22:18
Yup! Eliminating CNV_Either_Eye makes VIF for Case drop to 2.3. CaseString is just Case coded as AMD/Control instead of 0/1. — abalter, Jul 29 '19 at 22:42
So, if I take out CNV_Either_Eye I reduce the multicolinearity. But I lose the ability to determine if that variable individually affects the marker Shannon. And that's just the limitation of my data set. Is that correct? — abalter, Jul 29 '19 at 22:45
See https://stats.stackexchange.com/questions/70679/which-variance-inflation-factor-should-i-be-using-textgvif-or-textgvif for how to interpret the GVIF for categorical variables in your model for which you will have Df > 1. Also, the reference level you use in your model for these categorical variables may influence the size of GVIF, as explained here: https://statisticalhorizons.com/multicollinearity. — Isabella Ghement, Jul 29 '19 at 23:41
The article https://journal.r-project.org/archive/2016-2/imdadullah-aslam-altaf.pdf points out that "a high VIF is neither a necessary nor a sufficient measure of multicollinearity". It also mentions that "the remedy [that is, dropping one of the predictors from the model] may be worse than the disease in some situations, because, multicollinearity may prevent the precise estimation of parameters of the regression model." — Isabella Ghement, Jul 29 '19 at 23:44

Including multiple categorical predictors that are not independent in linear regression

0 Answers0