Based on Kaggle winners data, it seems that ensemble boosting methods like XGBOOST, LIGHTGBM, CATBOOST are the top choices when dealing with structured or tabular data for maximizing the prediction accuracy. However, in industry as far as I know, the trend is to move towards neural network models to solve the same problems. What are some reasons that we want to use deep learning/neural network models as opposed to boosting methods for tabular data?
-
1I don't think industry prefers deep learning over GBT for tabular data as per 2021, except if you count linear models to neural nets... Is there some empirical study in this direction? It would be interesting. – Michael M Dec 11 '21 at 09:19
-
1I agree with @MichaelM on this (+1). I have seen some places that already had a reasonable investment on Computer Vision and/or NLP try and leverage the same infrastructure for tabular data just because the infra (and people) are already there but they are not starting from scratch by no means. Also, both TensorFlow and PyTorch (and cuDNN) have *pretty decent* marketing mechanisms around them... ;) – usεr11852 Dec 16 '21 at 00:57
1 Answers
There are no reasons to use technique X
over technique Y
other than performance gains and/or ease of use. Just to clarify: by "performance gains" here I mean, raw metrics as well as overall metric behaviour as stability, out-of-sample generalisation and training technical requirements to get adequate performance; similarly "ease of use" is not confined to installation but also to training requirements (human ones), ease of auditing, integration to pipelines, etc.
Right now standard deep learning approaches are not outperforming standard GBMs approaches on tabular data. Kaggle competitions as you point out strongly suggest that. In addition, DL approaches have a higher technical threshold to start getting used (trained and deployed) as well as investigated (model explainability) so they are immediate reasons to use DLs over GBMs for tabular data. A very interesting recent reference on the matter is "Tabular Data: Deep Learning is Not All You Need" (2021) by Shwartz-Ziv & Armon; their study suggest that some of the current deep models are generally outperformed by XGBoost while XGBoost still requires less tuning.
This state doesn't have to be perpetuated for ever; graph neural networks might allows us to encode additional information in our NN that standard GBMs are unable to capture, similarly new NN architectures (e.g. NODE, TabNet, etc.) try to bridge the gap between GBMs and NNs. A very interesting recent survey paper is "Deep Neural Networks and Tabular Data: A Survey" (2021) by Borisov et al. looks at the whole issue quite holistically both for what is the current state as well as some open challenges. Finally, GBMs just might be "good enough" to survive for a very long time; for example Holt-Winters’ seasonal method (i.e. triple exponential smoothing) has been around since late 1950s, it is still pretty good for some problems! :)

- 33,608
- 2
- 75
- 117
-
I don't agree yet with the first sentence. There are many other aspects like interpretability or stability. – Michael M Dec 16 '21 at 07:07
-
Fair point. Sorry I was being a bit generic here. For example, the "interpretability" point (which I also make later) as part of the overall "ease of use" and I would consider "stability" as part of overall "performance". I will clarify a bit more. – usεr11852 Dec 16 '21 at 12:40