2

Is it reasonable to transform regression problem into classification by binning target variable into classes and construct regression curve separately on each class?\

Precisely, if my goal is to solve regression problem are the following steps reasonable:

  1. If my target variable is $Y$, Create $m$ classes $Y_1 =\{Y:Y<y_1\}, Y_2=\{Y:y_1\leq Y<y_2\},\dots,Y_m=\{Y:Y\geq y_m\}$.
  2. Construct classifier $p_j(x)=Pr(Y \in Y_j|X=x)$.
  3. Construct regression curves for each class separately $E[Y|Y_j,X=x]=f_j(x)$.
  4. Estimate final regression curve by $E[Y|X=x]=\sum_{i=1}^m p_j(x)f_j(x).$

Theoretically, if our goal is to construct regression curve $E[Y|X=x]$ than from identity $$ E[Y|X=x]= \sum_{i=1}^m Pr(Y \in Y_j|X=x) E[Y|Y_j,X=x]=\sum_{i=1}^m p_j(x)f_j(x)$$ it looks like steps described above are just a waste of time. But it’s possible that the estimation of $ p_j(x)f_j(x)$ could be done more effectively. Any literature or comment would be helpful.

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
D.G
  • 116
  • 6
  • 1
    Why do you want this? Binning is loosing information: [Binning is loosing](https://stats.stackexchange.com/questions/104402/what-is-the-justification-for-unsupervised-discretization-of-continuous-variable). – kjetil b halvorsen Aug 08 '19 at 09:31
  • 2
    I can't think of a statistical principle that would make one want to do this. – Frank Harrell Aug 08 '19 at 11:18
  • @kjetilbhalvorsen binning predictors is not a good idea, I'm interested on binning target variable – D.G Aug 08 '19 at 14:18
  • 3
    Binning the target variable neither is a good idea! Better tell us what is your ultimate modeling goal. – kjetil b halvorsen Aug 08 '19 at 14:31

1 Answers1

5

Short answer: It is most likely not reasonable.

While the question lacks details of the actual goal of the analysis, my best guess is that you assume your response to be nonlinear, since the procedure you describe roughly corresponds to replacing linear response with a piecewise linear response.

If you indeed need to do this, then you could benefit from some non-parametric regression, for example Gaussian process regression or regression splines that directly model non-linear responses and would likely provide both more flexibility and better guarantees than you proposed approach.

If you really need the response to be piecewise linear, you could likely construct a model that would fit both the regressions AND the locations of the breakpoints at the same time (at least this should be feasible with Stan, not sure about the frequentist case).

Martin Modrák
  • 2,065
  • 10
  • 27
  • 2
    Nice answer. In terms of literature, "Senn, S. (2003). Disappointing dichotomies. Pharmaceutical Statistics: The Journal of Applied Statistics in the Pharmaceutical Industry, 2(4), 239-240." might be a good one. – Björn Aug 12 '19 at 10:11