How to properly do stacking/meta ensembling with cross validation

Question

How do people use stacking or meta ensembling with cross validation in practice and in machine learning competitions like on Kaggle? Here are two approaches I've seen (but maybe neither is correct?)

Method1 (probably introduces a leak)

splits: A B C

First Layer Models

fit {KNN, SVM} on [A, B], predict on C -> C'
fit {KNN, SVM} on [B, C], predict on A -> A'
fit {KNN, SVM} on [C, A], predict on B -> B'

Meta Ensemble

fit LogReg on [A’, B’], predict on C’
fit LogReg on [B’, C’], predict on A’
fit LogReg on [C’, A’], predict on B’

Method2

splits: A B C D

First Layer Models (Fold D)

fit {KNN, SVM} on [A, B], predict on C -> C'
fit {KNN, SVM} on [B, C], predict on A -> A'
fit {KNN, SVM} on [C, A], predict on B -> B'
fit {KNN, SVM} on [A, B, C], predict on D -> D'

Meta Ensemble (Fold D)

fit LogReg on [A’, B’, C’], predict on D’

Repeat for folds A-C

I think method1 introduces a leak because, for example, when predicting on C' you're using the predictions of A' as features, which depend on the target values of C for the first level model fitting. On the other hand, Method 2 seems to prevent leakage, but it's kind of complex and it reduces the data used for fitting 1st level models. How are people stacking in practice?

Notice that option 2 is nearly equivalent to 4-fold cross-validating the black-boxed "3-fold stacking classifier" (the only difference is that the three folds are set to align with the four folds in the outer cross-validation), so indeed it shouldn't have target leakage. — Ben Reiniger, Jan 04 '21 at 02:47

score 2 · Answer 1 · answered Oct 10 '16 at 21:28

I think the only way to really determine this is to experiment. I made a small one here. I split the dataset in two and trained a model and the stacking model with the same training data at the core. In the 2nd one I trained it on the other half of the data. The accuracy of the second was was higher slightly. However, this could be explained away by the additional data that model gets to see. At the end of the day I think either method will work as long as the underlying models generalize well. It will also depend on how many observations there are to play with, training time, etc.

library(caret)

data("segmentationData")

segmentationData <- segmentationData[,c(-1,-2)]

inTrain = createDataPartition(segmentationData$Class, list = FALSE, p = 0.5)

x.train <- segmentationData[inTrain,]
x.lg <- segmentationData[-inTrain,]

fit.knn <- train(Class ~ ., x.train, method = "knn")
fit.svm <- train(Class ~ ., x.train, method = "svmRadial")

## Train Logistic Regression with same training data
e.train <- data.frame(knn = predict(fit.knn, x.train), svm = predict(fit.svm, x.train), Class = x.train$Class)
fit.lgB <- train(Class ~ ., e.train, method = "glm")

## Train Logistic Regression with different training data
e.train <- data.frame(knn = predict(fit.knn, x.lg), svm = predict(fit.svm, x.lg), Class = x.lg$Class)
fit.lgB <- train(Class ~ ., e.train, method = "glm")


resamps <- resamples(list(diff = fit.lgB, same = fit.lgA))

library(lattice)
bwplot(resamps)

> summary(resamps)

Call:
summary.resamples(object = resamps)

Models: diff, same 
Number of resamples: 25 

Accuracy 
       Min. 1st Qu. Median   Mean 3rd Qu.   Max. NA's
diff 0.7762  0.8142 0.8262 0.8249  0.8356 0.8753    0
same 0.7865  0.8037 0.8128 0.8148  0.8255 0.8538    0

Kappa 
       Min. 1st Qu. Median   Mean 3rd Qu.   Max. NA's
diff 0.5019   0.601 0.6273 0.6214  0.6466 0.7257    0
same 0.5380   0.575 0.5917 0.5955  0.6223 0.6776    0

Perhaps use this as a template for your own experiments :)

Ismael EL ATIFI · Answer 2 · 2021-01-04T14:19:58.317

Only method 2 is valid i.e. without any leakage.
When doing stacking/ensembling, the golden rule is that the meta model must be trained on a separate dataset than the dataset(s) used to train the base models.
This is really important otherwise you risk to get poor generalization (test accuracy) because of target leakage.
Target leakage is the issue of having some features (here the predictions of the base models) which are abnormally correlated with the target.
This happens if you train meta model and base models on the same dataset because predictions of the base models on this dataset are unrealistically very good (they leak the target).
If you do use a separate dataset to train the meta model, it correctly replicates what will happen at test time i.e. predictions of base models will be made on new data that the base models have never seen before (which is the case in method 2).
I quote this article :
"It is important that the meta-learner is trained on a separate dataset to the examples used to train the level 0 models to avoid overfitting."

Here is another simple cross validation method without any target leakage :

splits: A B C

1st iteration of cross validation with :
X, Y, Z = A, B, C

First Layer Models

fit base models on X

Meta Ensemble

fit meta model on Y (with predictions of base models)
evaluate it on Z (with predictions of base models)

Then repeat same procedure for all permutations :
2nd iteration with X, Y, Z = A, C, B
3rd iteration with X, Y, Z = B, A, C
4th iteration with X, Y, Z = B, C, A
5th iteration with X, Y, Z = C, A, B
6th iteration with X, Y, Z = C, B, A

I don't see how OP's method 2 includes leakage; can you clarify that? If it doesn't, then since it includes more data for each of the models, it would be preferable to your setup. — Ben Reiniger, Jan 04 '21 at 02:43
Indeed, my bad, after reading again his method 2 it seems ok. I update my post accordingly. However my method also uses all data after the 6th iterations so it is not evident that method 2 is better than mine. — Ismael EL ATIFI, Jan 04 '21 at 14:11

How to properly do stacking/meta ensembling with cross validation

Method1 (probably introduces a leak)

First Layer Models

Meta Ensemble

Method2

First Layer Models (Fold D)

Meta Ensemble (Fold D)

2 Answers2

First Layer Models

Meta Ensemble