Random Forest and Binary logistic regression (with quasi-complete separation issues)

Question

I am working on a data set (n= 230) with a categorical dependent variable (outcome: 0/1) and six categorical independent variables (mostly, with only two levels).

There is a certain degree of multicollinearity between two variables (X1 and X6. Anova model comparison shows that a model with X1 performs slightly better than one containing X6) and a quasi-complete separation issue regarding X4 (due to an empty cell).

I first ran a Random Forest model (all variables were included. Ntree = 5000, mtry = 3). The result was that X1, X2 and X3 are by far the most significant predictors. X4, X5 and X6 seem to have almost no discriminative power (especially X4 whose value in vimp() is 0.00).The model seems to be reliable (C = 0.73).

Question 1: does it make sense at this point to fit Binary Logistic Regression only on the most important predictors obtained through the Random Forest model (X1, X2, X3) without even considering the other three?

Question 2: In order to avoid the separation problem with Binary Logistic Regression would it make sense to get rid of X4? I am quite sure that the empty cell is a bias of my data set. Moreover, this category as a whole represents only 3% of the data (The contingency table is a: 140 b:0 c:86 d:6).

Re: Question 1, please see [Can a random forest be used for feature selection in multiple linear regression][https://stats.stackexchange.com/questions/164048/can-a-random-forest-be-used-for-feature-selection-in-multiple-linear-regression/164068#164068] — Sycorax, Dec 21 '17 at 16:26

Frank Harrell · Answer 1 · 2017-12-21T14:08:27.170

I don't see the need for random forest in this case, and random forest generally requires a much larger sample size than logistic regression. Also, correlations among predictors does not affect predictive accuracy, and complete separation does not disqualify predictions; you'll just see some infinite regression coefficients, and these translate to predicted probabilities of 0 or 1. Better would be to get some discounting using a Bayesian logistic model (using priors with finite variance) or penalized maximum likelihood estimation. For the latter see the R rms package lrm function.

It is not appropriate to do a first-stage analysis (using random forest or otherwise) to select predictors for the final stage. The model should be pre-specified whenever possible. You might consider unsupervised learning methods (data reduction, e.g. variable clustering or sparse principal components) in a first step. These methods do not use $Y$, so they are not biased in favor of falsely high predictive discrimination.

Keep in mind that you need 96 observations just to estimate the intercept in a binary logistic model, so you are on very thin ice. The n=96 requirement comes from considering the maximum margin of error in the context of a 0.95 confidence interval for estimating a probability with a simple proportion, setting the maximum margin of error to +/- 0.10. The logit of the proportion is exactly the intercept in the logistic model when there are no covariates. Adding covariates makes estimation more difficult. See RMS section 10.2.3 from fharrell.com/links for details.

Random Forest and Binary logistic regression (with quasi-complete separation issues)

1 Answers1