I'm working on a classification task where I have data from a certain company for years between 2017 and 2020. Trying to train different models (Random Forest, XgBoost, LightGBM, Catboost, Explainable Boosting Machines) on separate data with one year at a time from 2017 to 2019 and looking at the results for 2020, I see a curious behavior and I would like to understand whether it is a normal one in the literature or dependent on the particular data.
In particular, while training with data from 2019, all the boosting algorithms obtain better performances than random forest (0.78-0.79 AUC vs 0.76). This dramatically changes, when I train a model on 2017 or 2018 data for 2020. This data is slightly out of distribution, as there is for sure label shift and data is quite different. (and the learned models' feature importances/PdP are quite different between the years). But here Random Forest still learns to generalize decently (for 2020 data we have a AUC of 0.704 if trained on 2017 and 0.706 if trained on 2018), while the boosting algorithms have on average worse performance, with a big difference for LightGbm between the two datasets ( For 2017 Xgboost 0.567, LGBM, 0.565, Catboost 0.639, EBM 0.521; for 2018 Xgboost 0.661, LightGBM 0.734 (??), Catboost 0.639, EBM 0.685).
Provided I have not performed extensive hyperparameter tuning and further testing and this might be a really particular case dependent on data and hyperparameters, still, I was wondering:
Does there exist some literature (I cannot find) on the robustness out of distribution of Random Forest vs Boosting algorithms which might explain this behavior?
Because intuitively, it might make sense that the variance reduction obtained by bagging would help even out of distribution, as some learners might still have learnt something relevant, but I am not sure it is enough.
PS As a sanity check I also tried with a logistic regression and a gaussian NB, which have the same consistent decrease in performance (0.7 to 0.45-0.6).