2

I am dealing with a task where I train a classification model to predict whether an item is going to be returned in a web shop.

Can I use features which contain information from the target variable?

For example can I create a feature with the relative frequency of how often an item with a specific ID has been returned in the past?

My professor said that this belongs to data leakage and should be avoided (if I remember correctly) but I don't see why this would have a negative impact on the prediction process.

00schneider
  • 1,202
  • 1
  • 14
  • 26
end9
  • 23
  • 3

1 Answers1

2

You can, as long as such features are not computed using the value you're trying to predict. In time series, in order to predict today's (day $N$) value it is totally acceptable to have a feature -for example- containing the average value over the last 10 days (i.e. from day $N-10$ to day $N-1$).

By itself, this does not constitute target leakage, as it's an information that you would have available at the previous day.

Davide ND
  • 2,305
  • 8
  • 24
  • Is it a problem if I compute, say, the average returnal rate for an item and include the target variable of the occurence in the computation of that average? My training set has approximately 350.000 occurences and there are like 3000 unique item IDs, so it won't make too much of a difference in the vast majority of cases. – end9 Feb 21 '20 at 14:03
  • The thing is, I basically use data from the training set as features of the test set. While it does increase the performance on the test set by a good amount, I am confused as to whether this is allowed. – end9 Feb 21 '20 at 14:19
  • 2
    It is alright to create features with the training set and use them for predictions on independent data. The key point is that for feature engineering, the test data must not be used. – 00schneider Feb 21 '20 at 14:45
  • 1
    You can compute the average return rate - but it will have to be using only the target values you have in your training data. When computing the feature for your test data - this will STILL have to be computed using training targets, are you test target cannot be used – Davide ND Feb 21 '20 at 14:47