Mathematical differences between GBM, XGBoost, LightGBM, CatBoost?

Question

There exist several implementations of the GBDT family of model such as:

GBM
XGBoost
LightGBM
Catboost.

What are the mathematical differences between these different implementations?

Catboost seems to outperform the other implementations even by using only its default parameters according to this bench mark, but it is still very slow.

My guess is that catboost doesn't use the dummified variables, so the weight given to each (categorical) variable is more balanced compared to the other implementations, so the high-cardinality variables don't have more weight than the others. It allows the weak categorical (with low cardinality) to enter to some trees, hence better performance. Other than that, I have no further explanation.

Referred benchmark is by yandex / catboost team, we need independent one. — mrgloom, Oct 04 '19 at 14:16

Alexey Burnakov · Accepted Answer · 2017-10-18T12:20:17.220

My guess is that catboost doesn't use the dummified variables, so the weight given to each (categorical) variable is more balanced compared to the other implementations, so the high-cardinality variables don't have more weight than the others.

https://arxiv.org/abs/1706.09516

You want to look at this English language paper from the Yandex team about CATBoost mathematical uniqueness.

I read it briefly, and among few things I could understand quickly was the fact that they do not use the residuals obtained on TRAIN to do TRAIN, since these residuals create optimistic bias of the learning quality. (Update: this novelty brings about a way to battle the overfitting, which is one of the reasons the algorithm worked better compared to its analogues, apart from a variety of ways to preprocess categorical variables).

I am sorry for not giving you a specific and full answer.

Mathematical differences between GBM, XGBoost

First I suggest you read a paper by Friedman about Gradient Boosting Machine applied to linear regressor models, classifiers, and decision trees in particular. https://statweb.stanford.edu/~jhf/ftp/trebst.pdf

I would not go in the details here. It is just a good read covering various types of loss (L) and besides variable importance concept. Of course this is a milestone paper of implementation of the method of a descent in the space of functions (low-level models) rather than parameters in pursuit of loss minimization.

If you look here: https://arxiv.org/pdf/1603.02754.pdf

You find a mathematical vignette for XGBoost model by Tianqi Chen et al. Now it becomes interesting. A couple of mathematical deviations of this model form the classic Friedman's GBM are:

Regularized (penalized) parameters (and we remember that parameters in the boossting are the function, trees, or linear models): L1 and L2 are available.

Using second derivatives to speed up the process (if it was used before please correct me).

To this point: look here to find an implementation of quantile loss in CATBoost, which comes in handy and provides both first and second derivatives: https://github.com/catboost/catboost/blob/master/catboost/libs/algo/error_functions.h

class TQuantileError : public IDerCalcer<TQuantileError, /*StoreExpApproxParam*/ false> { public:
    const double QUANTILE_DER2 = 0.0;

    double Alpha;
    SAVELOAD(Alpha);

    explicit TQuantileError(bool storeExpApprox)
        : Alpha(0.5)
    {
        CB_ENSURE(storeExpApprox == StoreExpApprox, "Approx format does not match");
    }

    TQuantileError(double alpha, bool storeExpApprox)
        : Alpha(alpha)
    {
        Y_ASSERT(Alpha > -1e-6 && Alpha < 1.0 + 1e-6);
        CB_ENSURE(storeExpApprox == StoreExpApprox, "Approx format does not match");
    }

    double CalcDer(double approx, float target) const {
        return (target - approx > 0) ? Alpha : -(1 - Alpha);
    }

    double CalcDer2(double = 0, float = 0) const {
        return QUANTILE_DER2;
    } };

While you cannot find this useful L1 loss function in XGBoost, you can try to compare Yandex's implementation with some of the custom loss functions written for XGB.

Besides, CATBoost works excelently with categorical features, while XGBoost only accepts numeric inputs.

Consider this link: https://tech.yandex.com/catboost/doc/dg/concepts/algorithm-main-stages_cat-to-numberic-docpage/#algorithm-main-stages_cat-to-numberic

They offer a variety of ways to feed categorical features to the model training on top of using old and well-known one-hot approach. Decreasing dimensions of an input space without loosing much information is one of possible reasons the fitted model are less overfitted.

I am done. I don't use LightGBM, so cannot shed any light on it.

I wonder why would someone downvote an answer without giving a better answer? Welcome on stage to give any input beyond doing anonymous downvoting, sir. — Alexey Burnakov, Oct 16 '17 at 19:46
it responses exactly to my question for the catboost. Do you have any supplementary materials for LightGBM, XGBoost, and GBM? — Metariat, Oct 18 '17 at 07:51
Yes, I will entich my answer. Do you cope well with hard math or rather prefer intuitive explanations? — Alexey Burnakov, Oct 18 '17 at 09:06
Thank you, I can cope well with both, intuitive explanation is quicker to get, and mathematical details takes more time but is very benificial for further understanding / implementations. — Metariat, Oct 18 '17 at 09:15

Mathematical differences between GBM, XGBoost, LightGBM, CatBoost?

1 Answers1

Linked

Related