8

I have a dataset where yes=77 and no=16000, a highly imbalanced dataset. My plan was to identify the most important variables influencing the response variable using random forest and then develop a logistic regression model using the selected variable.

I am planning to use random forest package and resampling technique. I know it will reduce the type II error but increase the type I error. Is Random forest a reasonable technique for analyzing this data? Is there any other machine learning technique which would be more suitable for my case.

Satwik Bhattamishra
  • 1,446
  • 8
  • 24
MSilvy
  • 139
  • 1
  • 8

1 Answers1

5

There are usually two methods to deal with imbalanced data while using the random forest model. One approach is cost-sensitive learning and the other is sampling. For extremely imbalanced data, random forest generally tends to be biased towards the majority class.

The cost-sensitive approach would be to assign different weights to different classes. So if the minority class is assigned a higher weight and thus higher misclassification cost, then that can help reduce its biasness towards the majority class. You can use the class weight parameter of random forest in scikit-learn to assign weights to each class.

Secondly, there are different methods of sampling such as oversampling the minority class or undersampling the majority class etc... Although simple sampling methods improve the overall model performance, its preferable to go for a more specialized sampling method such as SMOTE and others to get a better model.

Most of the machine learning models suffer from the imbalanced data problem although there are some reasons to believe that generative models generally tend to perform better in case of imbalanced datasets.

Satwik Bhattamishra
  • 1,446
  • 8
  • 24
  • If you are an expert in this area, I would love some detailed answers to my questions: https://stats.stackexchange.com/questions/247871/what-is-the-root-cause-of-the-class-imbalance-problem, https://stats.stackexchange.com/questions/285231/what-problem-does-oversampling-undersampling-and-smote-solve – Matthew Drury Apr 16 '18 at 16:17
  • 2
    @Mathew Drury I've seen your questions and I never answered because I don't have an answer. However, I do have one thought. If one has the/a 'correct' set of variables, then one would expect the data space to be robust in the sense that small perturbations of the data shouldn't change the class label. Of course one has no guarantee that such is the case. One might consider checking robustness using knn. If knn showed that some nhd of a given data point was largely (mostly, entirely?) the same class label, then using smote should be effective. I believe there is a (continued) – meh Apr 16 '18 at 16:56
  • 1
    package in R that implements this. Of course this is absolutely not a theoretical explanation. I could conceive of a result that set that based on some measure of robustness as above, using smote would improve accuracy. – meh Apr 16 '18 at 16:58
  • 1
    Can you give reference where I can learn more about your generative models statement? – EngrStudent Apr 17 '18 at 16:18
  • You can refer to the book Dataset shift in ML which can be found at www.acad.bg/ebook/ml/The.MIT.Press.Dataset.Shift.in.Machine.Learning.Feb.2009.eBook-DDU.pdf. Specifically, you can take a look at lower half of page 17. Although I should mention that the general consensus is that discriminative models generally outperform generative models but given a normal sized imbalanced dataset and without using methods like SMOTE, one could go with a generative model but with the increase in the size of the dataset, a discriminative model will generally outperform a generative model at some point. – Satwik Bhattamishra Apr 17 '18 at 16:50
  • 2
    I have another question: Given, I have 68 predictors: should I select the most important variable first using Gini index than build prediction model? I am not sure why, my models using no resampling and Smote giving the same result Accuracy: 0.99 and sensitivity: 0.96 – MSilvy Apr 17 '18 at 18:26
  • 1
    A few pointers: While selecting features better do it based on CV score. A simple but commonly used method is Leave-one-out CV. Check your feature importance scores based on the trained random forest model. Take the least important features and perform LOO-CV. Which basically needs you to drop each feature and perform CV to check whether removing that feature improves your score or not. You can google it if you are unclear about LOO-CV. – Satwik Bhattamishra Apr 17 '18 at 18:37
  • Secondly but more importantly, dont use Accuracy to judge your model and it's even worse to use accuracy for imbalanced dataset. Use AUC ROC score, log loss or F1 score to judge your model. You can google for more metrics, there are some metrics specific for imbalanced dataset too. – Satwik Bhattamishra Apr 17 '18 at 18:39
  • Here is my model result using 65 predictors (both factor and numeric) and imbalanced dependent variable (2% yes vs 98% no) 1. Using Unbalanced Data: precision: 1.000 recall: 1.000 F: 0.500 Area under the curve (AUC): 1.000 2. Using Under-sampling precision: 0.615 recall: 1.000 F: 0.381 Area under the curve (AUC): 1.000 3. Using Oversampling precision: 1.000 recall: 1.000 F: 0.500 Area under the curve (AUC): 1.000 4. USing SMOTE: precision: 0.941 recall: 1.000 F: 0.485 Area under the curve (AUC): 1.000 Did I do something terribly wrong? What does the result mean? – MSilvy Apr 26 '18 at 22:33