Isn't stacking models a direct approach to overfitting?

Question

With help by the discussions here I successfully trained various models for classification.

As an example say I trained a stochastic gradient boosted model (gbm) and an extreme gradient boosted tree (xgboost). They are trained using cross-validation on a training set and then tested on a test set measuring AUC (I get values aroung 0.87

Now I would like to combine those models to get an even better one.

I tried to average the predicted probabilities and yes, AUC slightly improved on the test set.

But if I stack the models in the following sense:

calculate the predicted probabilities $p_{\text{gbm}}$ and $p_{\text{xgb}}$ on the training set and use these as predictors.
train some model (linear, tree) in the sense $\text{class} \sim p_{\text{gbm}}+p_{\text{xgb}}$

Models of this kind have AUC of 0.9 on the training set and 0.8 on the test set (less than the individual models).

Isn't using something more sophisticated than average or linear weighting just overfitting the training set? The information about the data does not get more. It is just hidden in the stage-one predictions.

I would appreciate any comment!

Consider this - model 1 predicts $0.1$ chance, model 2 predicts $0.95$ chance. Hard to combine without getting a worse prediction than either. Need careful choice of weights. — probabilityislogic, Jun 08 '16 at 09:22
Yes, I read (what is common sense if one thinks it through) that only ensembling (stacking, average) of model with similar precision works well. Furthermore they should be uncorrelated in order to improve each other. In my case the models have similar precision but they are highly correlated. — Richi W, Jun 08 '16 at 11:40

score 3 · Accepted Answer · answered May 16 '17 at 19:46

The problem is simple: your base classifiers (gbm and xgb) are biased with the predictions. If they are trained and tested on the same data, they usually perform better then on test data (although depends on your sample size and more...). What is the stacker left to learn? Biased predictions.

So you may use cross-validation for the base classifiers, train them and test them on unseen data (from the training set) and use those unbiased predictions to train the stacker on. This way you do not over-fit.

Two details: If you want to have a genius implementation of k-folding, you may consider using the REP package (if you use python).

Stacking two boosted decision trees won't give you a real improvement. Better use XGBoost alone, its a far superior classifier. Or stack it with SVM, NN or similar.

Isn't stacking models a direct approach to overfitting?

1 Answers1