I'm using generalized boosted regression models to explore what is the contribution of 20 independent environmental variables (x1, x2, ...., x20) to the explanation of the variability of the dependent environmental variable "y".
The independent variables are a mix of binomial and continuous variables, and the dependent variable is continue.
Samples were taken in different regions, sites and seasons, but some regions, sites and seasons have more data than others. Data would look like this: - region A: 15 sites, 309 samples; site A1: spring 12, summer 10, autumn 3, winter 1 - region B: 26 sites, 662 samples; site B1: spring 9, summer 29, autumn 5, winter 5
I'm using the function gbm() in r.
I have being told that this method is not appropriate for unbalanced datasets, as the results would be biased toward the group with more samples. Therefore I have being asked to include this issue in the model, but I haven't found how to do it.
My questions are:
Are generalized boosted regression models, and in particular the gbm function, suitable for unbalanced datasets? if so, does anybody know any reference to support this?
If those models doesn't work for unbalanced datasets, Does anybody have any suggestion about how to approach this problem?
Many thanks