I am working on a data set (n= 230) with a categorical dependent variable (outcome: 0/1) and six categorical independent variables (mostly, with only two levels).
There is a certain degree of multicollinearity between two variables (X1 and X6. Anova model comparison shows that a model with X1 performs slightly better than one containing X6) and a quasi-complete separation issue regarding X4 (due to an empty cell).
I first ran a Random Forest model (all variables were included. Ntree = 5000, mtry = 3). The result was that X1, X2 and X3 are by far the most significant predictors. X4, X5 and X6 seem to have almost no discriminative power (especially X4 whose value in vimp() is 0.00).The model seems to be reliable (C = 0.73).
Question 1: does it make sense at this point to fit Binary Logistic Regression only on the most important predictors obtained through the Random Forest model (X1, X2, X3) without even considering the other three?
Question 2: In order to avoid the separation problem with Binary Logistic Regression would it make sense to get rid of X4? I am quite sure that the empty cell is a bias of my data set. Moreover, this category as a whole represents only 3% of the data (The contingency table is a: 140 b:0 c:86 d:6).