2

I have a training set and a test set.
Let's assume the following:

  1. I train random forest on the training set
  2. I make prediction on training set and test set
  3. Then I add those prediction as features back into the training set and the test set
  4. Now I train another random forest but on the training set from step 3
  5. Finally I make prediction on the test set from step 3 with the model from step 4.

Would it count as data leakage when I use prediction from step 2 as feature to the model in step 4-5.

Viðar Ingason
  • 407
  • 2
  • 10

1 Answers1

1

No, it’s not. It’s a special case of stacking in general and used widely in practice, Especially with different methods one after another.

In typical stacking you use several models and a second level on top of these as a meta model. The meta model uses predictions from 1st level models and learns on top of them. You've just one model in your first level and a combiner in the second level with additional features (i.e. base features again). This is similar (not the same) to adding layers to neural networks.

I didn't encounter your specific case (i.e. RF after RF), but I recall several other problems first using a typical baseline method and then fit another model on predictions/residuals using extra features to boost it.

TL;DR This is not data leakage.

gunes
  • 49,700
  • 3
  • 39
  • 75