I have a regression problem that I solve with a random forest in R. Input are several predictors (up to 40), output is a point in time (represented as a year between 1867 and 2017). The reference data is severly unbalanced (>95% being between 2000 and 2017 and <1% being between 1867 and 1950).
For classification problems I found several answers (here is a good overview) to handle imbalanced data, however none for regression. I already tried treating the problem as classification (years can be interpreted as discrete, so factorizing the limited number of years was no issue), however this severly degraded performance. I'm not sure how exactly random forests treat the continuity (likely averaging?), but the "continous context" of the regression seems to provide good effects. So for now I'll stick with the regression.
I applied stratification on test and training set, making sure that out of the few examples at least some are in each set.
library(splitstackshape)
strat = stratified(data, "TARGET_VAR", N_SAMPLES, bothSets = TRUE)
I also played around with oversampling [which seems to lead to severe overfits] and undersampling [which basically throws away a large part of my tediously collected training data :/]. However results remain modest.
library(UBL)
# do training-test-split before (especially in case of oversampling!)
data_undersampled = RandUnderRegress(TARGET_VAR~., data)
data_oversampled = RandOverRegress(TARGET_VAR~., data)
Are there other/better techniques to handle random forest regression with unbalanced continous data?