9

For a classification problem (assume that the loss function is the negative binomial likelihood), the gradient boosting (GBM) algorithm computes the residuals (negative gradient) and then fit them by using a regression tree with mean square error (MSE) as the splitting criterion. How is that different from the XGBoost algorithm?

Does XGBoost utilize regression trees to fit the negative gradient? Is the only difference between GBM and XGBoost the regularization terms or does XGBoost use another split criterion to determine the regions of the regression tree?

Marjolein Fokkema
  • 1,363
  • 6
  • 22
gnikol
  • 657
  • 2
  • 6
  • 16
  • 3
    XGBoost is a particular implementation of GBM that has a few extensions to the core algorithm (as do many other implementations) that seem in many cases to improve performance slightly. You can specify your own loss function or use one of the off-the-shelf ones. – jbowman Mar 01 '18 at 14:29
  • I have modified slightly my question. My main question is whether XGBoost utilizes regression trees to fit the negative gradient with mse as the split criterion? – gnikol Mar 01 '18 at 15:00
  • 1
    You don't have to use mse, but you can. – jbowman Mar 01 '18 at 15:46

3 Answers3

6

@jbowman has the right answer: XGBoost is a particular implementation of GBM.

GBM is an algorithm and you can find the details in Greedy Function Approximation: A Gradient Boosting Machine.

XGBoost is an implementation of the GBM, you can configure in the GBM for what base learner to be used. It can be a tree, or stump or other models, even linear model.

Here is an example of using a linear model as base learning in XGBoost.

How does linear base learner works in boosting? And how does it works in the xgboost library?

Haitao Du
  • 32,885
  • 17
  • 118
  • 213
  • Thank you for your answer but I still do not get it. I have read the paper you cite and in step 4 of Algorithm 1 it uses the square loss to fit the negative gradient and in step 5 uses the loss function to find the optimal step. I have also read "Higgs Boson Discovery with Boosted Trees" which explains XGBoost and if I understand it correctly in order to determine the best split uses the loss function which need to be optimized and computes the loss reduction. It doesn't say anything about the square loss. – gnikol Mar 01 '18 at 16:06
  • 1
    @gnikol then what's your question? you are not connecting gmb paper with xgboost implementation? have you read this one? http://xgboost.readthedocs.io/en/latest/model.html – Haitao Du Mar 01 '18 at 16:07
  • I was trying to understand how are the two algorithms connected. At first I though that the only difference was the regularization terms. But I got lost regarding how XGBoost determines the tree structure. I know that GBM uses regression tree to fit the residual. And my question was whether XGBoost uses the same process but adds a regularization component. – gnikol Mar 01 '18 at 16:13
  • @gnikol If I remember correctly, XGboost is also using regression tree to fit. Even it is a classification problem. – Haitao Du Mar 01 '18 at 17:06
  • Yes it uses regression trees but i think with different split criterion. I also believe that if your overall objective is to minimize a specific Loss function it utilizes the reduction in this Loss function in the regression tree to find the tree structure/best split. But I am not sure. That is my question. – gnikol Mar 01 '18 at 17:24
  • 1
    @gnikol if you want to know the details, why no check the source code of xgboost? – Haitao Du Mar 01 '18 at 17:38
  • I thought to install the package in python and run the code line by line but i have problems installing it. Maybe i will check it in R. Thanks – gnikol Mar 01 '18 at 17:40
  • @gnikol do not do that since you are installing a binary / compiled version, this is why that thing runs fast. what you need to do is check their github source code. – Haitao Du Mar 01 '18 at 17:42
  • @gnikol here https://github.com/dmlc/xgboost/tree/master/src/tree – Haitao Du Mar 01 '18 at 17:43
1

the gradient boosting (GBM) algorithm computes the residuals (negative gradient) and then fit them by using a regression tree with mean square error (MSE) as the splitting criterion. How is that different from the XGBoost algorithm?

Both indeed fit a regression tree to minimize MSE w.r.t. a pseudo-response variable in every boosting iteration. This pseudo response is computed based on the original binary response, and predictions from the regression trees of previous iterations.

The two methods differ in how the pseudo response is computed. GBM uses a first-order derivative of the loss function at the current boosting iteration, while XGBoost uses both the first- and second-order derivatives. The latter is also known as Newton boosting.

A.f.a.i.k. R package gbm uses gradient boosting, by default. At each boosting iteration, the regression tree minimizes the least squares approximation to the negative gradient.

Is the only difference between GBM and XGBoost the regularization terms or does XGBoost use another split criterion to determine the regions of the regression tree?

I do not fully understand what you mean by the regularization term. The split criterion for the regression tree indeed differs, because XGBoost computes the pseudo-response variable differently. At each boosting iteration, it fits a regression tree to minimize a weighted least squares approximation. The pseudo response is calculated so that the more extreme the predictions from previous trees (i.e., linear predictor far from zero; predicted probability close to either 0 or 1), the less influence the observation will receive in the current iteration. Thus, the gradient is weighted by the uncertainty of predictions from previous trees.

Sigrist (2021) provides a good discussion of the differences, both in terms of predictive performance and the functions being optimized.

Sigrist, F. (2021). Gradient and Newton boosting for classification and regression. Expert Systems With Applications, 167, 114080. https://arxiv.org/abs/1808.03064

Marjolein Fokkema
  • 1,363
  • 6
  • 22
-2

Both are the same XG boost and GBM, both works on the same principle. In Xg boost parallel computation is possible, means in XG boost parallelly many GBM's are working. In Xgboost tunning parameters are more. Any of them can be used, I choose to go with XG boost due to some few more tuning parameters, giving slightly more accuracy.