How to (cross) validate a feature built with use of predicted variable

Question

Let's say my dataset has features $X_1$, $X_2$, $X_3$ and predicted variable Y. Now doing some feature engineering I came up with a feature X4 that is a mean(Y) of samples similar to the one we're looking at.

Before coming up with feature $X_4$ I did regular 10-fold CV. What is the correct way of including $X_4$ in validation? Am I correct to assume that $X_4$ for test set in each fold should only be based on samples from this test set?

I was looking into "impact-coding" which looks similar to what I'm doing, but was unable to find good info on cross-validating it.

score 1 · Accepted Answer · edited Apr 25 '18 at 05:07

1

Yes, the correct approach is to allow only in-sample observations to calculate the value of this feature. So, you'd have to calculate this feature in each of the 10 folds you test.

Note that this is, in other words, equivalent to applying a k-nearest-neighbor algorithm and using the outcome of that regressor as a predictor for a new regressor, also known as "stacking" a "knn-regressor" with whatever regressor you are applying after adding this feature. This is a well-known technique.

See this thread to find out more about cross-validating stacked models: Proper cross validation for stacking models

edited Apr 25 '18 at 05:07

Jacek Chmielewski

127
5

answered Apr 24 '18 at 10:09

Gijs

3,409
11
18

But every aspect of 'feature engineering' must be included for re-execution in the cross-validation process. – Frank Harrell Apr 24 '18 at 12:18

How to (cross) validate a feature built with use of predicted variable

1 Answers1