in Gradient Boosting, why are new trees, fit to the gradient of loss function instead of residual

Question

I went through the explanation here ( http://blog.kaggle.com/2017/01/23/a-kaggle-master-explains-gradient-boosting/), but don't understand why the new trees in gradient boosting try to predict he gradient of loss function instead of actual residual ( y-y^). And is my understanding right that the first tree uses (as its prediction) a simple constant value to approximate true values , like a median for all the y.

Also, another question is how does it work for classification.

Can anyone please explain the statement made below, in detail if possible, in here Understanding gradient boosting

"In general, gradient boosting, when used for classification, fits trees not on the level of the gradient of predicted probabilities, but to the gradient of the predicted log-odds."

you can use mathjax for math typesetting. more information: https://math.meta.stackexchange.com/questions/5020/mathjax-basic-tutorial-and-quick-reference — Sycorax, Aug 07 '19 at 17:05

PaulG · Answer 1 · 2020-11-22T18:42:16.627

1

For (half) squared error loss the negative gradient is the residual. In general, a tree is fit to the gradient in order to avoid overfitting, since the loss $L(y_i,\hat{y}_i)$ can be made arbitrarily small if we constantly add to $\hat{y}_i$ the (scaled) negative gradient of $L(\cdot)$ wrt $\hat{y}_i$. The additive expansion of trees (boosting) stops when the test error has settled in.

edited Nov 22 '20 at 18:42

answered Nov 21 '20 at 17:14

PaulG

793
2
10

in Gradient Boosting, why are new trees, fit to the gradient of loss function instead of residual

1 Answers1