I am trying gradient boosting on a dataset with event rate about 1% using Enterprise miner, but it is failing to produce any output. My question is, since it a decision tree based approach, is it even right to use gradient boosting with such low event?
-
3You are dealing with imbalanced dataset. Boosting is indeed a good way to cope with it. For details see http://stats.stackexchange.com/questions/157940/what-balancing-method-can-i-apply-to-a-imbalanced-data-set/180316#180316 – DaL Mar 01 '16 at 07:32
-
But for me logistic regression is giving better results than randomforest or gradient boosting. I wanted to improve the performance of my model, by trying the boosted trees. – user2542275 Mar 01 '16 at 08:57
-
Boosting is based on weak classifiers. Theoretically, any weak classifier that is slightly better than random will do. In practice different algorithms are more suitable to some datas set so the weak classifier you choose is important. Can you specify more about the algorithms you used, their results and the data set? – DaL Mar 01 '16 at 09:13
-
Ok. About the dataset: Sample size>4m, event rate=1.2%. Number of predictors which are significant p-value<0.05 are 150. Logistic regression with most significant variables gave lift of 3 at 20% population. Neural network gave a lift of about 2.8. Gradient boosting did not produced any output, until i used stratified sampling with inverse prior weights. But the performance is poor. – user2542275 Mar 01 '16 at 10:31
-
Since your data set is quite big, you should have enough samples of your minority class, so the problem is due to relative imbalance. You have quite a few features but not too much, but indeed decision tree are less suitable for such datasets. I suggest that you'll create a balanced dataset and see how well your algorithms perform on it. Than you'll be able to apply the algorithm on the original dataset the way I described in the first comment. – DaL Mar 01 '16 at 11:32
-
yes, I have done stratified sampling, but nothing is beating logistic regression as of now. – user2542275 Mar 01 '16 at 13:08
-
Let us [continue this discussion in chat](http://chat.stackexchange.com/rooms/36407/discussion-between-dan-levin-and-user2542275). – DaL Mar 01 '16 at 13:24
-
This is an old post, but I thought this might be of value: https://roamanalytics.com/2016/10/28/are-categorical-variables-getting-lost-in-your-random-forests/ It's a blog post where it's shown that if there is a large number of categorical variables encoded using One Hot Encoding, Random Forests (not sure if that still applies to GBMs) struggle trying to extract the valuable information, whereas Logistic Regression does a much better job. – ponadto Dec 27 '16 at 21:37
-
logistic regression does often tend to do better than tree-based models with very rare data. http://www.win-vector.com/blog/2015/02/does-balancing-classes-improve-classifier-performance/ – captain_ahab Dec 20 '17 at 18:25
1 Answers
(To give short answer to this:)
It is fine to use a gradient boosting machine algorithm when dealing with an imbalanced dataset. When dealing with a strongly imbalanced dataset it much more relevant to question the suitability of the metric used. We should potentially avoid metrics, like Accuracy or Recall, that are based on arbitrary thresholds, and opt for metrics, like AUCPR or Brier scoring, that give a more accurate picture - see the excellent CV.SE thread on: Why is accuracy not the best measure for assessing classification models? for more). Similarly, we could potentially employ a cost-sensitive approach by assigning different misclassification costs (e.g. see Masnadi-Shirazi & Vasconcelos (2011) Cost-Sensitive Boosting for a general view and proposed changes to known boosting algorithms or for a particular interesting application with a simpler approach check the Higgs Boson challenge report for the XGBoost algorithm; Chen & He (2015) Higgs Boson Discovery with Boosted Trees provide more details).
It is also worth noting that if we employ a probabilistic classifier (like GBMs) we can/should actively look into calibrating the returned probabilities (e.g. see Zadrozny & Elkan (2002) Transforming classifier scores into accurate multiclass probability estimates or Kull et al. (2017) Beta calibration: a well-founded and easily implemented improvement on logistic calibration for binary classifiers) to potentially augment our learner's performance. Especially when working with imbalanced data adequately capturing tendency changes might be more informative than simply labelling the data. To that extent, some might argue that cost-sensitive approaches are not that beneficial in the end (e.g. see Nikolaou et al. (2016) Cost-sensitive boosting algorithms: Do we really need them?). To reiterate the original point though, boosting algorithms are not inherently bad for imbalanced data and in certain cases they can offer a very competitive option.

- 33,608
- 2
- 75
- 117
-
I believe Brier scoring is equivalent to the Accuracy measure so will have the same limitations as Accuracy when assessing rare event models. – RobertF Feb 05 '20 at 15:23
-
1Brier score is not equivalent to Accuracy. Please note that we use the predicted probability for the calculation of the Brier score while for the Accuracy calculation we use labels based on hard thresholding of the predicted probabilities. – usεr11852 Feb 05 '20 at 16:17
-
Thanks for clarifying - using the estimated probability rather than 0/1 for the predicted class makes more sense. – RobertF Feb 07 '20 at 15:38
-