Hi I am trying to build a model predicting a binary outcome, say screened vs. non-screened.
A little bit about the data
1) I have about 40K records. 86% of them have the outcome as screened. It's a very unbalanced data.
2) And I have about 18 predictors. Most of them has weak correlation with the outcome.
The goal here is to find 3-5 predictors which are most powerful.
I tried two methods
1) Regular logistic regression. As you may expect, most cases came out as significant due to the large sample sizes.
Coefficients: (1 not defined because of singularities)
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.556e+00 2.703e-01 5.758 8.52e-09 ***
medinc 1.853e-06 9.007e-07 2.057 0.03969 *
medage -2.309e-02 2.337e-03 -9.880 < 2e-16 ***
raceeth_black 1.060e+00 4.329e-01 2.449 0.01431 *
raceeth_latino 5.238e-01 3.767e-01 1.390 0.16444
owner_occ -1.613e-02 3.068e-01 -0.053 0.95808
renter_occ NA NA NA NA
publicassist 5.416e-01 2.160e-01 2.508 0.01214 *
nocitizen -1.108e+00 2.785e-01 -3.980 6.90e-05 ***
no_health_ins 1.431e-01 2.495e-01 0.573 0.56639
unemployed -2.326e-01 5.456e-01 -0.426 0.66983
Mail_Return_Rate_CEN_2010 1.654e-02 3.019e-03 5.479 4.27e-08 ***
pct_URBANIZED_AREA_POP_CEN_2010 2.677e-03 5.481e-04 4.884 1.04e-06 ***
pct_RURAL_POP_CEN_2010 -4.142e-03 6.974e-04 -5.939 2.87e-09 ***
pct_Hispanic_CEN_2010 -6.583e-03 4.178e-03 -1.576 0.11510
pct_NH_White_alone_CEN_2010 -2.833e-03 1.980e-03 -1.431 0.15242
pct_NH_Blk_alone_CEN_2010 -8.922e-03 4.774e-03 -1.869 0.06165 .
pct_Owner_Occp_HU_CEN_2010 8.470e-03 3.156e-03 2.683 0.00729 **
samp_16 -8.917e-01 3.428e-02 -26.016 < 2e-16 ***
2) Then I tried the random forest and print out the MeanGini by their importance.
V1 MeanDecreaseGini
1 samp_16 825.8772
2 medage 469.3604
3 pct_NH_Blk_alone_CEN_2010 466.0336
4 no_health_ins 459.0699
5 unemployed 452.0088
Both models indicate samp_16 and medage is important or significant, which looks right to me. Logistic regression have more significant variables. However, some variables are significant are not showing on the top variables with importance in the Random forest model. And two variables like no_health_ins and unemployed which on the top 5 most important variables in random forest, do even show as significant in the logistic regression.
How should I interpret it? Is it because most variables are too weak? Should I simply pick the variables that were both significant in the logistic and show importance in the random forest?
And is there anything special we need to deal with the logistic regression when the data is unbalanced? should I do the undersampling/oversampling first? For the random forest, I try to apply the classwt but both lead to similar results as to the importance vector .