Model Stacking - Gives poor performance

Question

I'm trying model stacking in a kaggle competition. However, what the competition is trying to do is irrelevant.

I think my approach of doing model stacking is not correct.

I have 4 different models:

xgboost model with dense features (numbers, that can be ordered).
adaboost model with sparse features (non-numeric features, which are label encoded, then one hot encoded).
xghoost model with dense features (sentiment analysis using nltk's vader on text).

These models generate their probabilities of a multi-class problem, and feeds into a final neural network model that combine their results, and then generate another set of probabilities of a multi class problem.

However, the more models I tried to munge in creates a worse model. For example, If I only use the first model, I would get 73% accuracy, but with each model added, it would drop to less than 70% accuracy, with the score on kaggle increasing from 0.6X to 1.0 above.

Is this approach incorrect?

does this accuracy based on the training or testing dataset? — itdxer, Mar 29 '17 at 15:11
so what about training, is it going up? maybe you just overfitting — itdxer, Mar 29 '17 at 15:12
@itdxer Nope, the general direction is downtrend. Are multiple models ever used? — user1157751, Mar 29 '17 at 15:14

score 5 · Accepted Answer · answered Mar 29 '17 at 22:26

5

It sounds like you may not be generating the "probabilities" (aka "level-one" data) correctly. These predicted values should be cross-validated predicted values from the base learners (or sometimes people use a separate hold-out set to generate these predicted values). My guess is that you are using predictions generated exclusively from the training set, which is leading to overfitting.

Here are some references which explain the construction of the level-one dataset in more detail:

Soon, we will release H2O with XGBoost support, so you should be able to ensemble XGBoost models much more easily using the Stacked Ensemble method in H2O. Or you could use H2O models for the time-being and skip the manual construction of the ensemble.

answered Mar 29 '17 at 22:26

Erin LeDell

765
3
11

For the first level, I split my data 90/10 for training and testing. I created my 1st layer model with 90% of available data with CV. Then I use the 1st layer models to create a final layer. Do you mean I should somehow combine the whole process into one step? – user1157751 Mar 29 '17 at 22:53
You should always combine the entire process into one step. Every decision you make should always be included in a validtion loop. – Matthew Drury Mar 30 '17 at 00:16
@user1157751 So you are saying that you're training your metalearner (the DNN in your case) on 90% of the data, which is the cross-validated predictions from the base models? If so, then your method is okay. I assume you are leaving the 10% test set for the very end to evaluate the performance of the ensemble. Perhaps your DNN metalearner is poorly trained and a GLM will work better as a metalearner. I find that unless you really spend some effort tuning the DNN, you will get better results with a GLM for the metalearning step. – Erin LeDell Mar 30 '17 at 00:21
@user1157751 What is the dimension of your level-one data? It should be n x L where n = number of training observations and L = number of base learners in the ensemble (in your case L = 3). – Erin LeDell Mar 30 '17 at 00:22
1

I'm training my DNN on 90% of the data, which is CV predictions ok my base models. I'm also leaving 10% test set for the very end. I tried another XGBoost at the end instead of DNN, and the performance was even worse. My n is around 50000. I'm thinking there might be something wrong with my feature extraction? I tried a XGBoost model with all features and it performed even worse than stacked model (both DNN/XGBoost). Does that give you any clue? – user1157751 Mar 30 '17 at 00:44
@user1157751 Tree-based models don't usually work well as metalearners. I think you should try a GLM. How many columns do you have on the level-one data (the matrix of CV predictions)? – Erin LeDell Mar 30 '17 at 03:13
1. XGBoost model: 10 columns numeric, 2. Adaboost model: one hot encoded sparse data, 5000 columns, 3. XGBoost model: 3 column numeric. – user1157751 Mar 30 '17 at 03:33
@user1157751I'm not sure you understand what I mean by "level-one" data. It's the data frame that you train the metalearner on (your DNN). I am wondering if you constructed that correctly. The size of it should be n rows and 3 columns, in your case. – Erin LeDell Mar 30 '17 at 20:36
@ErinLeDell Oh, sorry, then I have misunderstood your question. It's actually n rows x 9 columns? 3 Models produces 3 probabilities of each class, and they are fed into a DNN. – user1157751 Mar 31 '17 at 01:06
Let us [continue this discussion in chat](http://chat.stackexchange.com/rooms/56335/discussion-between-erin-ledell-and-user1157751). – Erin LeDell Mar 31 '17 at 01:12
@ErinLeDell Gotcha – user1157751 Mar 31 '17 at 02:05

score 2 · Answer 2 · answered Aug 28 '18 at 02:23

Stacking can give poor performance relative to the base models if a lot of overlap exists in the correct predictions of the ensembled models. Also, stacking tends to do better with a larger number of input models than with a smaller number of ensembled models.

score 0 · Answer 3 · answered Apr 02 '21 at 14:21

It is quite easy to mess up the first stage model or to fail to see the leakage of informations when working with large blends of models. As stated by @Erin LeDell you should make sure that the second stage is learned from cross validated predictions of the first stage. I wrote the following tutorials regarding blending if you are interested:

Introduction to blending in Python (method and implementation oriented)

Why does blending works ? (theoretical arguments about the success of this method)

Model Stacking - Gives poor performance

3 Answers3