Why use supervised binning on train data if it leaks data?

Question

I have a dataset which has Quantity ordered (along with other variables like product type, product price, customer group etc). Target variable is whether customer churned or not. I am doing this to convert my continuous variable into categorical values like high, med, low based on Qty ordered level

However, my question is not based on the dataset itself but on the technique called supervised binning.

Doesn't supervised binning qualify as data leakage? because we create bins based on the target variable (train data only). Later, we use that info (bin info based on target column) and feed it as input to the model.

Can you share some insights on whether it is recommended to this?

If yes, why so?

If not, why so? Because, I see lot of tutorials and posts on doing supervised binning for discretization of continuous variables (during data preparation). Should I only use unsupervised binning?

I am trying to bin to gain some insights. Meaning, if I know that orders with qty values between 3400 to 6500 leads to churn more often, then thats useful. If I have them as continuous, it may not give such insights (rather than letting me know that Qty is an important variable). Instead of me manually creating bins, I want the system to find out cut points automatically — The Great, Feb 10 '22 at 07:30
I also felt discretization helps handle outliers (putting them in one bucket) — The Great, Feb 10 '22 at 07:32
Discretization loses a lot of information, https://stats.stackexchange.com/q/68834/1352, so I would argue it will rather *reduce* insight than create it. It will probably be much better to use your quantity as a numerical variable, possibly spline transform it, and derive your insights from plots of model fits or predictions. Also, yes, binning will leak data, unless you bin based on the training data only. — Stephan Kolassa, Feb 10 '22 at 07:37
Is spline transform better over quartile/decile based discretization? Would you suggest unsupervised discretization? — The Great, Feb 10 '22 at 07:50
I would almost always prefer spline transforms over any kind of discretization. Discretization means that all items in one bin are treated the same, and items in different bins are treated as completely unrelated - but items in bins 1 and 2 are closer together than items in bin 1 and 9, and discretization completely loses this information. [Here is an illustration in the context of binning time series data into hours.](https://stats.stackexchange.com/a/478175/1352) A spline transform yields much more reasonable fits, and expends far fewer degrees of freedom. — Stephan Kolassa, Feb 10 '22 at 07:55
And of course, choose spline knots based on the training data only, to avoid data leakage. — Stephan Kolassa, Feb 10 '22 at 07:55
my question on data leakage was not based on test info slipping into train, but the target info (from train set) used to create bin info (created as input variable during train). Doesn;t this qualify as data leakage as well? This sort of leakage can lead to overfitting? No? — The Great, Feb 10 '22 at 08:02
In that case, whether I use `spline`, `Decison Trees` etc, for train data only, wouldn't they qualify as data leakage? — The Great, Feb 10 '22 at 08:04
Data leakage happens when test data you should not be using during training slips in. If you bin your predictors based on the outcome, all *on training data only*, that is not data leakage - that is simply a transformation of the predictors, a straightforward part of training. (And I still think discretization is not a good idea.) — Stephan Kolassa, Feb 10 '22 at 08:06

Tim · Accepted Answer · 2022-02-10T08:53:39.983

As already noticed in the comments and another answer, you need to train the binning algorithm using training data only, in such a case it has no chance to leak the test data, as it hasn't seen it.

But you seem to be concerned with the fact that the binning algorithm uses the labels, so it "leaks" the labels to the features. This concern makes sense, in the end if you had a model like

$$ y = f(y) $$

it would be quite useless. It would predict nothing and it would be unusable at prediction time, when you have no access to the labels. But it is not that bad.

First, notice that any machine learning algorithm has access to both labels and the features during the training, so if you weren't allowed to look at the labels while training, you couldn't do it. The best example would be naive Bayes algorithm that groups the data by the labels $Y$, calculates the empirical probabilities for the labels $p(Y=c)$, and the empirical probabilities for the features given (grouped by) each label $p(X_i | Y=c)$, and combines those using Bayes theorem

$$ p(Y=c) \prod_{i=1}^n p(X_i | Y=c) $$

If you think about it, it is almost like a generalization of the binning idea to the smooth categories: in binning we transform $X_i | Y=c$ to discrete bins, while naive Bayes replaces it with a probability (continous score!). Of course, the difference is that with binning you then use the features as input for another model, but basically the idea is like a kind of poor man's naive Bayes algorithm.

Finally, as noticed by Stephan Kolassa in the comment, binning is usually discouraged. It results in loosing information, so you have worse quality features to train as compared to the raw data. Ask yourself if you really need to bin the data in the first place.

thanks Tim. it helps. upvoted for your effort – The Great Feb 10 '22 at 08:54 — The Great, Feb 10 '22 at 08:54

score 2 · Answer 2 · answered Feb 10 '22 at 07:50

2

If you only use training data for supervised binning, you cannot leak information from the test dataset, simply because you are not using it. So, no, when done right, there is no leakage.

answered Feb 10 '22 at 07:50

frank

1,434
1
8
13

Sorry, can I check how can supervised binning be done without target/label data? – The Great Feb 10 '22 at 07:50
You mean, I can use train data and train labels, to create bins? Is that what you are suggesting? But I should not use the full datsaet to do the binning? btw, upvoted for the help – The Great Feb 10 '22 at 07:51
Yes, that's what I mean. That's what you always do in supervised learning: you use both the input *and* target data for training. Leakage concerns whether you somehow sneak information from the test dataset into your training algorithms, so that the algorithm already knows about the test dataset and thus performs better. – frank Feb 10 '22 at 07:55
But my question on data leakage was not based on test info slipping into train, but the target info (from train) used as bin info (created as input variable during train). Doesn;t this qualify as data leakage as well? This sort of leakage can lead to overfitting? No? – The Great Feb 10 '22 at 08:01
2

Using information from the target part of your training data to bin the input part of your training data does not qualify as leakage. You are free to use any part of your training data in any way you wish. Whether some ways lead to more overfitting than others is a different matter, it doesn't have anything to do with leakage. – frank Feb 10 '22 at 08:20

Why use supervised binning on train data if it leaks data?

2 Answers2