2

Hi I am trying to build a model predicting a binary outcome, say screened vs. non-screened.
A little bit about the data 1) I have about 40K records. 86% of them have the outcome as screened. It's a very unbalanced data. 2) And I have about 18 predictors. Most of them has weak correlation with the outcome. The goal here is to find 3-5 predictors which are most powerful. I tried two methods 1) Regular logistic regression. As you may expect, most cases came out as significant due to the large sample sizes.

Coefficients: (1 not defined because of singularities)
                                  Estimate Std. Error z value Pr(>|z|)    
(Intercept)                      1.556e+00  2.703e-01   5.758 8.52e-09 ***
medinc                           1.853e-06  9.007e-07   2.057  0.03969 *  
medage                          -2.309e-02  2.337e-03  -9.880  < 2e-16 ***
raceeth_black                    1.060e+00  4.329e-01   2.449  0.01431 *  
raceeth_latino                   5.238e-01  3.767e-01   1.390  0.16444    
owner_occ                       -1.613e-02  3.068e-01  -0.053  0.95808    
renter_occ                              NA         NA      NA       NA    
publicassist                     5.416e-01  2.160e-01   2.508  0.01214 *  
nocitizen                       -1.108e+00  2.785e-01  -3.980 6.90e-05 ***
no_health_ins                    1.431e-01  2.495e-01   0.573  0.56639    
unemployed                      -2.326e-01  5.456e-01  -0.426  0.66983    
Mail_Return_Rate_CEN_2010        1.654e-02  3.019e-03   5.479 4.27e-08 ***
pct_URBANIZED_AREA_POP_CEN_2010  2.677e-03  5.481e-04   4.884 1.04e-06 ***
pct_RURAL_POP_CEN_2010          -4.142e-03  6.974e-04  -5.939 2.87e-09 ***
pct_Hispanic_CEN_2010           -6.583e-03  4.178e-03  -1.576  0.11510    
pct_NH_White_alone_CEN_2010     -2.833e-03  1.980e-03  -1.431  0.15242    
pct_NH_Blk_alone_CEN_2010       -8.922e-03  4.774e-03  -1.869  0.06165 .  
pct_Owner_Occp_HU_CEN_2010       8.470e-03  3.156e-03   2.683  0.00729 ** 
samp_16                         -8.917e-01  3.428e-02 -26.016  < 2e-16 ***

2) Then I tried the random forest and print out the MeanGini by their importance.

                      V1      MeanDecreaseGini
1                   samp_16         825.8772
2                    medage         469.3604
3 pct_NH_Blk_alone_CEN_2010         466.0336
4             no_health_ins         459.0699
5                unemployed         452.0088

Both models indicate samp_16 and medage is important or significant, which looks right to me. Logistic regression have more significant variables. However, some variables are significant are not showing on the top variables with importance in the Random forest model. And two variables like no_health_ins and unemployed which on the top 5 most important variables in random forest, do even show as significant in the logistic regression.

How should I interpret it? Is it because most variables are too weak? Should I simply pick the variables that were both significant in the logistic and show importance in the random forest?

And is there anything special we need to deal with the logistic regression when the data is unbalanced? should I do the undersampling/oversampling first? For the random forest, I try to apply the classwt but both lead to similar results as to the importance vector .

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
Wei Zeng
  • 21
  • 1
  • 1
    First, if you're concerned with class imbalance, you should address that issue first, using for example SMOTE or similar aglorithms. Then you fit your logistic regression and RF and look for variable importance information. – horaceT Sep 14 '16 at 21:05
  • 1
    GLM and RadnomForest are estimating two different things. For an example of why this is important, please see my answer here, which might be a duplicate. http://stats.stackexchange.com/questions/164048/can-random-forest-be-used-for-feature-selection-in-multiple-linear-regression/164068#164068 – Sycorax Sep 14 '16 at 22:09
  • Thanks for pointing me to that post. Very informative. So should I say that variables which show as important in RF but not significant in the logistic regression are possible those with non-linear relation with the outcome variables? – Wei Zeng Sep 14 '16 at 22:43
  • Logistic regression have no problems with unbalanced classes, see https://stats.stackexchange.com/questions/6067/does-an-unbalanced-sample-matter-when-doing-logistic-regression – kjetil b halvorsen Sep 07 '17 at 11:12
  • Your classes are that unbalanced either. – David Ernst Sep 07 '17 at 13:28

0 Answers0