My guess is that catboost doesn't use the dummified variables, so the weight given to each (categorical) variable is more balanced compared to the other implementations, so the high-cardinality variables don't have more weight than the others.
https://arxiv.org/abs/1706.09516
You want to look at this English language paper from the Yandex team about CATBoost mathematical uniqueness.
I read it briefly, and among few things I could understand quickly was the fact that they do not use the residuals obtained on TRAIN to do TRAIN, since these residuals create optimistic bias of the learning quality. (Update: this novelty brings about a way to battle the overfitting, which is one of the reasons the algorithm worked better compared to its analogues, apart from a variety of ways to preprocess categorical variables).
I am sorry for not giving you a specific and full answer.
Mathematical differences between GBM, XGBoost
First I suggest you read a paper by Friedman about Gradient Boosting Machine applied to linear regressor models, classifiers, and decision trees in particular. https://statweb.stanford.edu/~jhf/ftp/trebst.pdf
I would not go in the details here. It is just a good read covering various types of loss (L) and besides variable importance concept. Of course this is a milestone paper of implementation of the method of a descent in the space of functions (low-level models) rather than parameters in pursuit of loss minimization.
If you look here: https://arxiv.org/pdf/1603.02754.pdf
You find a mathematical vignette for XGBoost model by Tianqi Chen et al. Now it becomes interesting. A couple of mathematical deviations of this model form the classic Friedman's GBM are:
- Regularized (penalized) parameters (and we remember that parameters
in the boossting are the function, trees, or linear models): L1 and
L2 are available.

- Using second derivatives to speed up the process (if it was used
before please correct me).
To this point: look here to find an implementation of quantile loss in CATBoost, which comes in handy and provides both first and second derivatives: https://github.com/catboost/catboost/blob/master/catboost/libs/algo/error_functions.h
class TQuantileError : public IDerCalcer<TQuantileError, /*StoreExpApproxParam*/ false> { public:
const double QUANTILE_DER2 = 0.0;
double Alpha;
SAVELOAD(Alpha);
explicit TQuantileError(bool storeExpApprox)
: Alpha(0.5)
{
CB_ENSURE(storeExpApprox == StoreExpApprox, "Approx format does not match");
}
TQuantileError(double alpha, bool storeExpApprox)
: Alpha(alpha)
{
Y_ASSERT(Alpha > -1e-6 && Alpha < 1.0 + 1e-6);
CB_ENSURE(storeExpApprox == StoreExpApprox, "Approx format does not match");
}
double CalcDer(double approx, float target) const {
return (target - approx > 0) ? Alpha : -(1 - Alpha);
}
double CalcDer2(double = 0, float = 0) const {
return QUANTILE_DER2;
} };
While you cannot find this useful L1 loss function in XGBoost, you can try to compare Yandex's implementation with some of the custom loss functions written for XGB.
- Besides, CATBoost works excelently with categorical features, while
XGBoost only accepts numeric inputs.
Consider this link: https://tech.yandex.com/catboost/doc/dg/concepts/algorithm-main-stages_cat-to-numberic-docpage/#algorithm-main-stages_cat-to-numberic
They offer a variety of ways to feed categorical features to the model training on top of using old and well-known one-hot approach. Decreasing dimensions of an input space without loosing much information is one of possible reasons the fitted model are less overfitted.
I am done. I don't use LightGBM, so cannot shed any light on it.