2

I'm using generalized boosted regression models to explore what is the contribution of 20 independent environmental variables (x1, x2, ...., x20) to the explanation of the variability of the dependent environmental variable "y".

The independent variables are a mix of binomial and continuous variables, and the dependent variable is continue.

Samples were taken in different regions, sites and seasons, but some regions, sites and seasons have more data than others. Data would look like this: - region A: 15 sites, 309 samples; site A1: spring 12, summer 10, autumn 3, winter 1 - region B: 26 sites, 662 samples; site B1: spring 9, summer 29, autumn 5, winter 5

I'm using the function gbm() in r.

I have being told that this method is not appropriate for unbalanced datasets, as the results would be biased toward the group with more samples. Therefore I have being asked to include this issue in the model, but I haven't found how to do it.

My questions are:

  1. Are generalized boosted regression models, and in particular the gbm function, suitable for unbalanced datasets? if so, does anybody know any reference to support this?

  2. If those models doesn't work for unbalanced datasets, Does anybody have any suggestion about how to approach this problem?

Many thanks

LMC
  • 21
  • 3
  • It's not true that GBM is any more problematic for unbalanced data than any other model or algorithm. In general, useful machine learning models estimate probabilities, and these probabilities will naturally reflect the balance in your dataset. To give any more advice though, you will need to add details about your project, and what you hope to accomplish. You have described your data, but not what you hope to accomplish with it, what problem you are trying to solve. This kind of information is important if you wish to get some advice. – Matthew Drury Jun 10 '18 at 21:19
  • Thanks Matthew for your comments. I have edited my question with more information, I hope this will make it more clear. – LMC Jun 10 '18 at 22:37
  • I'm not sure if I have completely understood your answer. Do you mean that it´s ok to use GBM with unbalanced datasets? If so, I would have to provide an explanation to the referee, so if you know somewhere I can read more about it, it would be very useful, many thanks – LMC Jun 10 '18 at 22:51
  • This is mostly based on years of experience using boosting to fit predictive models in insurance, where all data sets are always imbalanced. I'm not sure of a source for this kind of thing, but there are a few threads on this site abut the imbalanced datasets in ML in general, here's the most popular: https://stats.stackexchange.com/questions/283170/when-is-unbalanced-data-really-a-problem-in-machine-learning/284074?utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_rich_qa Did the reviewer provide a source for their claim that boosting has general difficulties with imbalance? – Matthew Drury Jun 13 '18 at 18:23
  • One suggestion which is, unfortunately not a reference, is to simply demonstrate that your model is not suffering from these issues. You may want to consider providing the reviewer with a model calibration plot: http://scikit-learn.org/stable/auto_examples/calibration/plot_calibration_curve.html – Matthew Drury Jun 13 '18 at 18:26
  • Hi @MatthewDrury, thank you very much for the links.The reviewer did not give any specific reason. He/she just pointed out that my database was unbalanced, asked how I accounted for that in my model, and if i didn't, asked to do it. Looking in detail to my dataset, there is only a few sites in one region that are quite umbalanced, so maybe removing those would be a potential solution. I will keep looking for references about this type of models and how to they deal with umbalanced databases to decide. Thanks – LMC Jun 16 '18 at 19:21

0 Answers0