17

I have been trying to understand gradient boosting reading various blogs, websites and trying to find my answer by looking through for example the XGBoost source code. However, I cannot seem to find an understandable explanation of how gradient boosting algorithms produce probability estimates. So, how do they calculate the probabilities?

Icyeval
  • 383
  • 3
  • 7
  • 3
    This essentially asks and answers the same question, in case a different explanation would be useful to you: https://stats.stackexchange.com/questions/204154/classification-with-gradient-boosting-how-to-keep-the-prediction-in-0-1/205140#205140 – Matthew Drury Jul 08 '18 at 21:33

1 Answers1

17

TL;DR: The log-odds for a sample is the sum of the weights of its terminal leafs. The probability of the sample belonging to class 1 is the inverse-logit transformation of the sum.


Analogously to logistic regression, the logistic function computes probabilities that are linear on the logit scale:

$$ z = Xw \\ \mathbb{P}(y=1|X) = \frac{1}{1 + \exp(-z)} $$

Unlike logistic regression, the "features" in $X$ are constructed as the terminal nodes of an ensemble of decision trees using the boosting procedure. Each row of $X$ collects the terminal leafs for each sample; the row is a $T$-hot binary vector, for $T$ the number of trees. (Each XGBoost tree is generated according to a particular algorithm, but that's not relevant here.)

There are $n$ columns in $X$, one column for each terminal node. There is no expression for the total number of terminal nodes, because the number of nodes can vary between trees (and usually does, in my experience).

Each leaf in the tree has an associated "weight." That weight is recorded in $w$. To be conformable with $X$, there are $n$ elements in $w$. The weights themselves are derived from the gradient boosting procedure; see: In XGboost are weights estimated for each sample and then averaged

Sycorax
  • 76,417
  • 20
  • 189
  • 313
  • This is very helpful, thanks. How many elements would the beta-vector contain? Would it be equal to the number of _total_ leaf nodes across _all_ trees? (And there would be equal number of columns in the X matrix, correct?) – Vishal Jun 06 '18 at 20:00
  • Thank you for the updated answer. Does this mean that there's a _unique_ `X` matrix as well as a _unique_ set of betas _for each_ sample/observation (`i`)? In other words, for every sample/observation for which you want to calculate the probability of belonging to class 1, you'd need to determine the unique values of the `X` matrix and beta vector? – Vishal Jul 03 '18 at 15:50
  • 1
    Each row of $X$ stores the terminal leafs for a sample. – Sycorax Jul 03 '18 at 15:54
  • @Sycorax when you say each row of X collects terminal leafs of each sample, can you give an example? – Maths12 Sep 02 '20 at 14:43
  • 1
    Suppose you have 3 trees and each tree has 8 leaves (24 in total), which we can index 1,2,3...8. A sample might end up in leaves [1,2,5] of each tree respectively, so a row of $X$ is all zeros except columns 1, 10 and 21, which contain a 1. – Sycorax Sep 02 '20 at 14:48
  • @Sycorax is the weight associated with a leaf = 'mean of values in the leaf' ? – tjt Oct 16 '20 at 18:11
  • @tjt No, gradient boosting is used to determine leaf weights. https://stats.stackexchange.com/questions/288150/are-xgboost-probability-outputs-based-on-the-number-of-examples-in-a-terminal-le/288215#288215 – Sycorax Oct 16 '20 at 18:14