Transform target/label variable into classes but classes are data-dependent: How to approach this correctly?

Question

I was redirected from StackOverflow because my question is more about theory.

I have a usual set-up with a pandas dataframe with some features and a numeric target variable (financial returns for example). Now I want to make a classification problem out of it: Rather than predicting the numerical value of the return, I want to predict classes. My question deals now about how I do correctly create the classes of the target variable if the creation of the classes are dependent on the data. For example I want to make 4 classes (1,...4) for the target variable based on the quartiles of the target variable. But my believe is, that when I have the full data set, I cannot calculate the quantile on the whole target variable and then make a train/test split afterwards and do a CV on the train set. Because then the calculated quantile values to create the classes are based on the test data as well. So my question is, how can I approach such a task in a sklearn framework? I saw that there exists the class TransformedTargetRegressor which goes into this direction: One could possibly use this together with KBinsDiscretizer for transforming the target variable. But a problem I see there is that it always backtransforms then the classes into numerical values when using .predict etc. but I want to do a classification not predicting numerical values.

Or: Would it be allowed to estimate the quartiles on the whole dataset and then classify the all target observations based on this? - But I would have data leakage problem there right?

Happy for any help.

This was the link: https://stackoverflow.com/questions/69142504/transform-target-label-variable-into-classes-but-classes-are-data-dependent-how — alphaH, Sep 11 '21 at 14:24
I think you should do it in the same way you would use any other preprocessing transformer (e.g. StandardScaler). 1. Calculate the quartiles using the training set, and 2. Transform your target variable from continuous to categorical on both the training and test sets using the quartiles found in the previous step. — Adrià Luz, Sep 11 '21 at 14:37
[Discretizing a continuous variable throws away a lot of information](https://stats.stackexchange.com/q/68834/1352), and I have never seen a case where it was beneficial. Why do you want to do this? — Stephan Kolassa, Sep 11 '21 at 14:39
@ Adrià: Thanks. I think that is correct, but say I want to do a cross-validation which is implemented directly in sklearn. When I do it for preprocessing features, I create a pipline and then feed in this pipeline into cross_validate() or GridSearchCV(). But is there not also a method such that I can use those routine, otherwise it gets very difficult? — alphaH, Sep 11 '21 at 14:55
@ Stephan: Seems to be common, see e.g. article "Forecasting multinomial stock returns using machine learningmethods" — alphaH, Sep 11 '21 at 14:55
@alphaH If there aren't any off-the-shelf transformers in scikit-learn that do exactly what you want, you can always use a [custom transformer](https://www.section.io/engineering-education/custom-transformer/). — Adrià Luz, Sep 11 '21 at 15:04
@ Adrià: Yes that's true, but the thing is KBinsDiscretizer does this. But pipeline always transforms only the features and not the labels; but using pipeline is important for cross-validation etc. So I need to know on how I can implement it but for the target not on the features. — alphaH, Sep 11 '21 at 15:10
@AdriàLuz Your comment "I think you should do it in the same way you would use any other preprocessing transformer..." seems like a perfectly good start for an answer. Remember, stats.SE isn't a site where asking for code is on-topic, so you only need to answer the statistical portion of this question ("How to approach this correctly?") — Sycorax, Sep 11 '21 at 17:34

Adrià Luz · Accepted Answer · 2021-09-11T18:33:04.893

As per @Sycorax's suggestion, I'm expanding my first comment as an answer...

I think you should do it in the same way you would use any other preprocessing transformer. That is:

Calculate the quartiles using the training set
Transform your target variable from continuous to categorical on both the training and test sets using the quartiles found in the previous step

I'll illustrate this with an example. For simplicity, let's imagine you only had 10 observations. This is how your target variable might look like:

Next, you randomly split your dataset into train (70%) and test (30%).

Train:

Test:

(I know this all looks a bit ridiculous with such a small number of observations but the main idea is the important bit).

Now, you calculate the quartiles from the training set. These are: $$q_1=-0.040 \\ q_2=0.100 \\ q_3=0.145$$ Using this information, now you proceed to transform your target variable on both the training and test sets using the following logic: $$Y^* = \begin{cases} 1 & \text{if } Y\leq q_1\\ 2 & \text{if } q_1 < Y\leq q_2\\ 3 & \text{if } q_2 < Y\leq q_3\\ 4 & \text{if } Y > q_3 \end{cases}$$

Train:

Test:

Now you can train a model using $Y^*$ as your target variable.

I'm not sure why the tables are not showing properly (they looked fine in the preview as I was writing the answer!) — Adrià Luz, Sep 11 '21 at 18:15
Thank you for taking the time. Makes sense. I guess for the answer about implemented sklearn methods I need to change forum. — alphaH, Sep 11 '21 at 18:26
@AdriàLuz (+1) Thanks for writing this up! I was also puzzled by the problem with tables -- they're fine in the edit view, but broken when you display the page. It must be some bug. — Sycorax, Sep 11 '21 at 18:28

Transform target/label variable into classes but classes are data-dependent: How to approach this correctly?

1 Answers1