8

The extreme-gradient boosting algorithm seems to be widely applied these days. I often have the feeling that boosted models tend to overfit. I know that there are parameters in the algorithm to prevent this. Sticking to the documentation here the parameters subsample and colsample_bytree could (among others) prevent overfitting. But they do not serve for the same purpose as bagging xgboosted models would - right?

My question: would you apply bagging on top of xgboost to reduce the variance of the fit?

So far the question is statistical and I dare to add a code detail: in case bagging makes sense I would be happy about example code using the R package caret.

EDIT after the remark: if we rely on the parameters only to control the overfit, then how can we design the cross-validation best? I have approx. 6000 data points and apply 5-fold x-validation. What could improve the out-of-sample performance: going to something like 10-fold x-validation or doing repeated 5-fold x-validation? Just to mention: I use the package cartet where such strategies are implemented.

Richi W
  • 3,216
  • 3
  • 30
  • 53
  • 2
    Just a comment. You didn't mention the learning rate of boosted models explicitly, which is extremely important in preventing over-fitting. – Matthew Drury Mar 25 '16 at 16:29
  • Could work, but ensembles of ensembles can grow quite big. It may be more efficient to find a appropriate set of training parameters not leading to over fitting for a given data set. – Soren Havelund Welling Mar 27 '16 at 13:43
  • @SorenHavelundWelling please see my edit. – Richi W Mar 29 '16 at 06:47
  • 1
    http://link.springer.com/article/10.1186/1758-2946-6-10 If I were to publish some A-grade ML model I would go for the proposed **Algorithm 3: repeated grid-search cross-validation for variable selection and parameter tuning**. I don't use `caret` that much (I should). As I remember `caret` do not provide a outer cross-validation for a grid-search. I would feel comfortable by wrapping a `caret grid search` in a outer 5 or 10fold-CV loop and check if each fold optimal paramters close to the same. For final model, pick the typical parameter set from folds and use outer CV as error estimation. – Soren Havelund Welling Mar 29 '16 at 20:42
  • @SorenHavelundWelling I opened up a discussion about overfitting here: http://stats.stackexchange.com/questions/204489/discussion-about-overfit-in-xgboost in case you want to join. – Richi W Mar 30 '16 at 08:03
  • There was an article about that in JMLR by Tuv, Borisov, Runger and Torkkola on this. https://www.jmlr.org/papers/volume10/tuv09a/tuv09a.pdf – EngrStudent Nov 02 '20 at 00:31

1 Answers1

2

The bag in bagging is about aggregation. If you have k CART models then for an input you get k candidate answers. How do you reduce that to a single value. The aggregation does that. It is often a measure of central tendency like the mean or mode.

In order to aggregate, you would need multiple outputs. The gradient boosted machine (gbm) as in XGboost, is a series ensemble, not parallel one. This means that it lines them all up in a bucket brigade, and all the learners (but front and back) take the output of one, and give it to the next one). The final output is the same structure as a CART model - a single output. There is no bootstrapping to be done on a single element.

EngrStudent
  • 8,232
  • 2
  • 29
  • 82
  • 1
    Nice to see a good answer to a long-unanswered question (+1). The serial nature of gradient boosted machines is why slow learning at each step is so important to avoid the overfitting noted by the OP. – EdM Jan 04 '21 at 20:07
  • Slow learning does several good things. There isn't momentum, so the best resolving power of the learner is governed by the single learning rate. – EngrStudent Jan 25 '21 at 12:44