1

I have a regression problem that I solve with a random forest in R. Input are several predictors (up to 40), output is a point in time (represented as a year between 1867 and 2017). The reference data is severly unbalanced (>95% being between 2000 and 2017 and <1% being between 1867 and 1950).

For classification problems I found several answers (here is a good overview) to handle imbalanced data, however none for regression. I already tried treating the problem as classification (years can be interpreted as discrete, so factorizing the limited number of years was no issue), however this severly degraded performance. I'm not sure how exactly random forests treat the continuity (likely averaging?), but the "continous context" of the regression seems to provide good effects. So for now I'll stick with the regression.

I applied stratification on test and training set, making sure that out of the few examples at least some are in each set.

library(splitstackshape)
strat = stratified(data, "TARGET_VAR",  N_SAMPLES, bothSets = TRUE)

I also played around with oversampling [which seems to lead to severe overfits] and undersampling [which basically throws away a large part of my tediously collected training data :/]. However results remain modest.

library(UBL)
# do training-test-split before (especially in case of oversampling!)
data_undersampled = RandUnderRegress(TARGET_VAR~., data)
data_oversampled = RandOverRegress(TARGET_VAR~., data)

Are there other/better techniques to handle random forest regression with unbalanced continous data?

Honeybear
  • 599
  • 1
  • 6
  • 8
  • 4
    Why do you have imbalanced data? Does the imbalance reflect the reality of the situation? // [Class imbalance in "classification" problems is not a problem, and over/under-sampling will not solve a non-problem.](https://stats.stackexchange.com/questions/357466/are-unbalanced-datasets-problematic-and-how-does-oversampling-purport-to-he) – Dave Jul 28 '21 at 19:36
  • Thanks for the great link! Even though the the imbalance partially reflects reality, I think it still is a problem. We are mapping unused forests for ecosystem mapping, respectively the year of the last usage (forests tend to be used, hence the year tends to be closer to the present) but stems also from the fact that it is much easier to get reference data for currently managed forests. The model generelly overestimates the year (prediction > real year of usage), it is biased by the underrepresentation of "late years" (<1950), significantly lowering their recall (26% as a class vs. 94% >1997). – Honeybear Jul 29 '21 at 06:47

0 Answers0