In practice, how to discretize continuous regressor with minimal impact on coefficient (or minimal information loss)?

Question

Suppose I have some continuous data that looks like this (this is a mini example, not my real data):

X = [1.61247174986927   2.65691016769718    0.591138214153149    
0.726195765274149   2.88156040072165    1.62455101313526     
6.43225443007122    0.590263950142884   3.05416345831489     
2.82441594177780    1.27093403949212    0.414863903556840    
1.34369968006468    0.367816560010304   1.19023283647451     
4.39095587146157    2.42508655542887    0.295173291557651    
0.842110993459900   4.94140793763529],

Suppose I need to run regression $Y_i=a+bX_i+cZ_i+e_i$. Suppose I need to discretize $X_i$ into only 4 values, how shall I do the discretization to minimize the impact on estimated $\widehat{b}$ (for example, if $\widehat{b}$ is significant under the original $X_i$, the new coefficient in front of discretized $X_i$ would better be significant), or vaguely, minimize information loss.

Consider looking into regression trees, which essentially do this for predictors as well. The literature there may have derived some optimality results for the splitting decisions. — Noah, Oct 15 '20 at 08:29
[Don't bin your continuous data](https://stats.stackexchange.com/q/68834/1352). Feed them into your algorithm as-is; potentially transform them using (e.g.) restricted cubic splines (see, e.g., Frank Harrell's *Regression Modeling Strategies*) to capture any nonlinearity. In particular, don't go hunting for significance by "adjusting" bins. — Stephan Kolassa, Oct 15 '20 at 08:38
@StephanKolassa Thanks! But the thing is I need to make my result comparable to past results, which summarizes $X_i$ into a discrete-valued index. What's the best way to discretize it then? — T34driver, Oct 15 '20 at 17:26
Honestly, if people in the past shot themselves in the foot, then I would not try to benchmark my own foot-shooting against the state of the art, but try to do better. OK, I understand this is not realistic. My recommendation: don't search for the "best" way to do something bad. Instead, use the simplest possible binning for your comparison (e.g., bins of equal width or equal contents), and spend more brainpower and words in explaining why binning is a bad idea, and how to model your process better. — Stephan Kolassa, Oct 15 '20 at 18:33
@StephanKolassa Looks like you have answered, can you do it formally? — kjetil b halvorsen, Oct 16 '20 at 02:28

score 2 · Accepted Answer · answered Oct 16 '20 at 08:48

Don't bin your continuous data. Feed them into your algorithm as-is; potentially transform them using (e.g.) restricted cubic splines (see, e.g., Frank Harrell's Regression Modeling Strategies) to capture any nonlinearity.

In particular, don't go hunting for significance by "adjusting" bins. Your $p$ values will be biased low. This is no different than other ways of tweaking models to achieve low $p$ values.

You write:

I need to make my result comparable to past results, which summarizes $X_i$ into a discrete-valued index.

Honestly, if people in the past shot themselves in the foot, then I would not try to benchmark my own foot-shooting against the state of the art, but try to do better.

OK, I understand this is not realistic. My recommendation: don't search for the "best" way to do something bad. Instead, use the simplest possible binning for your comparison (e.g., bins of equal width or equal contents), and spend more brainpower and words in explaining why binning is a bad idea, and how to model your process and data better. Help the field grow out of bad practices.

In practice, how to discretize continuous regressor with minimal impact on coefficient (or minimal information loss)?

1 Answers1