Data leakage if I add prediction as feature?

Question

I have a training set and a test set.
Let's assume the following:

I train random forest on the training set
I make prediction on training set and test set
Then I add those prediction as features back into the training set and the test set
Now I train another random forest but on the training set from step 3
Finally I make prediction on the test set from step 3 with the model from step 4.

Would it count as data leakage when I use prediction from step 2 as feature to the model in step 4-5.

gunes · Accepted Answer · 2019-11-22T17:26:26.683

1

No, it’s not. It’s a special case of stacking in general and used widely in practice, Especially with different methods one after another.

In typical stacking you use several models and a second level on top of these as a meta model. The meta model uses predictions from 1st level models and learns on top of them. You've just one model in your first level and a combiner in the second level with additional features (i.e. base features again). This is similar (not the same) to adding layers to neural networks.

I didn't encounter your specific case (i.e. RF after RF), but I recall several other problems first using a typical baseline method and then fit another model on predictions/residuals using extra features to boost it.

TL;DR This is not data leakage.

edited Nov 22 '19 at 17:26

answered Nov 22 '19 at 14:51

gunes

49,700
3
39
75

Do you have any reference to this being a special case of stacking. I would like to read something about this or see the results of this being used. – Viðar Ingason Nov 22 '19 at 16:15
I've added some more comments. – gunes Nov 22 '19 at 17:26

Data leakage if I add prediction as feature?

1 Answers1

Linked