A good alternative to data binning?

Question

I read many times that data binning of continuous variables is a very bad idea.

For instance, let's take something like heart rate and let's define the following 2 bins:

(125 - 135), (136 - 145)

Let's say that (136 - 145) corresponds to a hard effort.

If your exercise session causes your heart rate to stay at 135 consistently, data binning will reveal that you spent no time exercising hard, while you were 1 beat per minute away from that bin during the whole time.

Obviously, this is an exaggerated example to illustrate the point.

I was thinking to weigh each second spent at a given heart rate based on its distance from the centers of the bins it falls between.

For instance, in the example above, 140.5 would be 100% in the second bin, 135.5 would be 50% in the first bin and 50% in the second bin, 130 would be 100% in the first bin.

Does that sound like nonsense or a reasonable solution?

What would be a better way?

Classification or regression? What is your response and do you have any other covariates? — M. Berk, May 15 '14 at 13:39
@PeterFlom It is assumed that exercise in specific bins produces specific physiological adaptations. In order to be able to quantify how an exercise session contributed to the various physiological adaptations, binning seems attractive, but... — L_R_T, May 15 '14 at 13:42
Yes, OK, but the assumptions about the bins are designed to allow a person who is exercising to make a decision about how hard to exercise; they are coarser then the raw data. It seems unlikely that the physiology of a heart rate of 135 is the same as 125 but different from 136. So, you should account for the binning at a later stage in the analysis - when you are making recommendations. — Peter Flom, May 15 '14 at 14:17
@Laurent, you get quantification with regression also. Think of the simplest case: a line: you get both a $y$-intercept *quantity*, but also the quantity of how much $y$ changes by given a 1-unit increase in $x$. Of course the exciting and fun stuff comes with multiple regression and interactions, and nonlinear regression with possibly complex relationships between $y$ and a given $x$. But these are all *quantified*. — Alexis, May 15 '14 at 14:27
Your proposal is not nonsense. Mathematically, it is a linear spline. Splines are well-studied in regression and often are a good choice to handle non-linear relationships among dependent and independent variables, as discussed (in detail) in [Frank Harrell's book](http://www.amazon.com/Regression-Modeling-Strategies-Applications-Statistics/dp/0387952322) *inter alia.* Splines are more flexible tools than your approach, though, indicating that a studied application of them cannot be any worse and could be somewhat (or perhaps a lot) better. — whuber, May 15 '14 at 16:20
@PeterFlom Yes, I agree that any bin boundary will be arbitrary and hard to justify, but there is an expectation that certain zones have certain meaning. I must respect that convention. — L_R_T, May 15 '14 at 17:56
@Alexis What would be x, what would be y? How would I glance at a regression line/curve and figure out how much time I spent in any give bin/zone? — L_R_T, May 15 '14 at 17:58
If the data are continuous, what happens to someone whose heart rate is 135.6? — Glen_b, May 15 '14 at 19:36
My point was **not binning is quantitative**. Binning suffers from loss of power and the potential for quite serious aggregation bias, this is true both theoretically, and in my experience with actual data. — Alexis, May 15 '14 at 19:42
@Alexis I got you, but this is as much a data visualization problem as it is a data analysis problem. — L_R_T, May 16 '14 at 00:49
Therefore why not use the most assumption-free data visualization techniques? I.e. nonparametric smoothing regression that lets the relationship between $y$ (outcome) and $x$ predictor speak for itself, without imposing arbitrary (and potentially quite biasing) aggregation structure onto it? — Alexis, May 16 '14 at 02:27
Possible duplicate of [What is the justification for unsupervised discretization of continuous variables?](https://stats.stackexchange.com/questions/104402/what-is-the-justification-for-unsupervised-discretization-of-continuous-variable) — kjetil b halvorsen, Dec 19 '18 at 12:10
Other dup: https://stats.stackexchange.com/questions/68834/what-is-the-benefit-of-breaking-up-a-continuous-predictor-variable?noredirect=1&lq=1 — kjetil b halvorsen, Dec 19 '18 at 12:30
This https://stats.stackexchange.com/a/232088/99274 shows the effect of mean x-value bin assignments. — Carl, Dec 19 '18 at 22:31

A good alternative to data binning?

0 Answers0