1

Suppose I have some continuous data that looks like this (this is a mini example, not my real data):

X = [1.61247174986927   2.65691016769718    0.591138214153149    
0.726195765274149   2.88156040072165    1.62455101313526     
6.43225443007122    0.590263950142884   3.05416345831489     
2.82441594177780    1.27093403949212    0.414863903556840    
1.34369968006468    0.367816560010304   1.19023283647451     
4.39095587146157    2.42508655542887    0.295173291557651    
0.842110993459900   4.94140793763529],

Suppose I need to run regression $Y_i=a+bX_i+cZ_i+e_i$. Suppose I need to discretize $X_i$ into only 4 values, how shall I do the discretization to minimize the impact on estimated $\widehat{b}$ (for example, if $\widehat{b}$ is significant under the original $X_i$, the new coefficient in front of discretized $X_i$ would better be significant), or vaguely, minimize information loss.

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
T34driver
  • 1,608
  • 5
  • 11
  • 3
    Consider looking into regression trees, which essentially do this for predictors as well. The literature there may have derived some optimality results for the splitting decisions. – Noah Oct 15 '20 at 08:29
  • 3
    [Don't bin your continuous data](https://stats.stackexchange.com/q/68834/1352). Feed them into your algorithm as-is; potentially transform them using (e.g.) restricted cubic splines (see, e.g., Frank Harrell's *Regression Modeling Strategies*) to capture any nonlinearity. In particular, don't go hunting for significance by "adjusting" bins. – Stephan Kolassa Oct 15 '20 at 08:38
  • @StephanKolassa Thanks! But the thing is I need to make my result comparable to past results, which summarizes $X_i$ into a discrete-valued index. What's the best way to discretize it then? – T34driver Oct 15 '20 at 17:26
  • @Noah Thanks! Can you recommend a few papers to me? – T34driver Oct 15 '20 at 17:26
  • 2
    Honestly, if people in the past shot themselves in the foot, then I would not try to benchmark my own foot-shooting against the state of the art, but try to do better. OK, I understand this is not realistic. My recommendation: don't search for the "best" way to do something bad. Instead, use the simplest possible binning for your comparison (e.g., bins of equal width or equal contents), and spend more brainpower and words in explaining why binning is a bad idea, and how to model your process better. – Stephan Kolassa Oct 15 '20 at 18:33
  • @StephanKolassa Thank you! This is really good advice. – T34driver Oct 15 '20 at 21:53
  • @StephanKolassa Looks like you have answered, can you do it formally? – kjetil b halvorsen Oct 16 '20 at 02:28

1 Answers1

2

Don't bin your continuous data. Feed them into your algorithm as-is; potentially transform them using (e.g.) restricted cubic splines (see, e.g., Frank Harrell's Regression Modeling Strategies) to capture any nonlinearity.

In particular, don't go hunting for significance by "adjusting" bins. Your $p$ values will be biased low. This is no different than other ways of tweaking models to achieve low $p$ values.

You write:

I need to make my result comparable to past results, which summarizes $X_i$ into a discrete-valued index.

Honestly, if people in the past shot themselves in the foot, then I would not try to benchmark my own foot-shooting against the state of the art, but try to do better.

OK, I understand this is not realistic. My recommendation: don't search for the "best" way to do something bad. Instead, use the simplest possible binning for your comparison (e.g., bins of equal width or equal contents), and spend more brainpower and words in explaining why binning is a bad idea, and how to model your process and data better. Help the field grow out of bad practices.

Stephan Kolassa
  • 95,027
  • 13
  • 197
  • 357