Are XGBoost probability outputs based on the number of examples in a terminal leaf

Question

I am trying to replace a c4.5 tree that someone else implemented with a boosted tree (XGBoost). The data is extremely skewed and the company wants the new model to output similar distributions.

c4.5 trees determine probabilities based on the number of observations that end in a terminal leaf, and I was wondering if that is the case with XGBoost.

Sycorax · Accepted Answer · 2020-10-22T19:44:25.527

4

Are XGBoost probability outputs based on the number of examples in a terminal leaf?

No. XGBoost is a gradient boosted tree, so it's estimating weights $c \in \mathbb{R^M}$ that assigns weight the $M$ leafs. A sample prediction (on the logit scale) is the sum of its leafs' weights. In the binary case, the inverse logistic function of the logit score yields a predicted probability.

The XGBoost paper has a helpful description of how it works.

Tianqi Chen, Carlos Guestrin, "XGBoost: A Scalable Tree Boosting System."

See also: In XGboost are weights estimated for each sample and then averaged

edited Oct 22 '20 at 19:44

answered Jun 30 '17 at 16:59

Sycorax

76,417
20
189
313

So using the skewed training set as-is would not give me probabilities that are closer to reality than probabilities I would get after training xgboost with an under-sampled training set ? – alwayslearning Jun 30 '17 at 17:41
That component of your question has been answered several times. See the threads in this search: https://stats.stackexchange.com/search?q=xgboost+imbalanced – Sycorax Jun 30 '17 at 17:42
2

I know how to tweak XGBoost to handle imbalanced datasets. On my comment I was basically asking about calibrating the output of XGBoost by not under sampling, but I think I should ask in a separate post, and after running some experiments I did see a drop in the average probability output when I don't under sample. thanks for your help! – alwayslearning Jun 30 '17 at 22:16

score 1 · Answer 2 · answered Jun 30 '17 at 10:18

No. I do not know how XGBoost estimates probabilities, but from my experience if you have say 100 samples, the probabilities estimates will not (necessarily) be multiples of 0.01 (which would be the case for decision trees). Therefore, there must be something else in the estimation of probabilities with XGBoost.

Are XGBoost probability outputs based on the number of examples in a terminal leaf

2 Answers2

Linked