how to help the tree-based model extrapolate?

Question

The following example borrow from forecastxgb author's blog, the tree-based model can't extrapolate in it's nature, but there are definitely some method to combine the benefit of tree model (interaction factors) and linear model's trend extrapolate ability. Could anyone give some ideas?

I have seen some kaggle solution, some people advise using the linear model prediction as the tree model's feature, it can imporve the prediction result, but how to improve the extrapolate?

Another idea is using the xgboost predict the residual of the linear model, this can help the prediction a lot.

Is there anyway?

library(xgboost)  # extreme gradient boosting
set.seed(134) # for reproducibility
x <- 1:100 + rnorm(100)
y <-   3 + 0.3 * x + rnorm(100)
extrap <- data.frame(x = 101:120 + rnorm(20))

xg_params <- list(objective = "reg:linear", max.depth = 2)
mod_cv <- xgb.cv(label = y, params = xg_params, data = as.matrix(x), nrounds = 40, nfold = 10) 
# choose nrounds that gives best value of root mean square error on the training set
best_nrounds <- which(mod_cv$evaluation_log$test_rmse_mean == min(mod_cv$evaluation_log$test_rmse_mean))
mod_xg <- xgboost(label = y, params = xg_params, data = as.matrix(x), nrounds = best_nrounds)

p <- function(title){
  plot(x, y, xlim = c(0, 150), ylim = c(0, 50), pch = 19, cex = 0.6,
      main = title, xlab = "", ylab = "", font.main = 1)
  grid()
}

predshape <- 1
p("Extreme gradient boosting")
points(extrap$x, predict(mod_xg, newdata = as.matrix(extrap)), col = "darkgreen", pch = predshape)

xgboost forcasting result

mod_lm <- lm(y ~ x)
p("Linear regression")
points(extrap$x, predict(mod_lm, newdata = extrap), col = "red", pch = predshape)

linear model forecasting result

In this case, I have tried use the prediction of linear model as the xgboost feature, but it didn't help the result (is it true?). I also tried use the residual of linear model as the xgboost feature, it seem better than the former result, but still have bad extrapolate. The example come from this site:http://ellisp.github.io/blog/2016/12/10/extrapolation — wolfe, Mar 23 '17 at 23:18
I have read a lot of question about the extrapolate problem of tree-based model, but it seems there are not solution about it. — wolfe, Mar 23 '17 at 23:30
I haven't enough reputation to invite @MatthewDrury to answer my question, I think he can answer it. His answer in another correlation question:http://stats.stackexchange.com/questions/262114/does-it-make-sense-to-log-transform-the-dependent-when-using-gradient-boosted-tr?rq=1 — wolfe, Mar 24 '17 at 05:00

score 1 · Answer 1 · answered Oct 07 '19 at 15:58

It seems to me that tree-based models are very bad at extrapolation, please look at this discussion https://www.kaggle.com/c/web-traffic-time-series-forecasting/discussion/38352.

Some people also pointed out that XGBoost has some weak "potential" at extrapolation https://github.com/dmlc/xgboost/issues/1581, but in general and in my personal applications, I do not see that "potential" really helps.

Here is one way around using stacked model: https://www.kaggle.com/serigne/stacked-regressions-top-4-on-leaderboard

I am hitting the wall of xgboost myself and turn to use deep learning algorithms for cases requiring extrapolation.

What specific deep learning algorithms did you use, and were you successful? — Derrick, Jul 24 '20 at 02:19

how to help the tree-based model extrapolate?

1 Answers1

Linked