2

I read many times that data binning of continuous variables is a very bad idea.

For instance, let's take something like heart rate and let's define the following 2 bins:

(125 - 135), (136 - 145)

Let's say that (136 - 145) corresponds to a hard effort.

If your exercise session causes your heart rate to stay at 135 consistently, data binning will reveal that you spent no time exercising hard, while you were 1 beat per minute away from that bin during the whole time.

Obviously, this is an exaggerated example to illustrate the point.

I was thinking to weigh each second spent at a given heart rate based on its distance from the centers of the bins it falls between.

For instance, in the example above, 140.5 would be 100% in the second bin, 135.5 would be 50% in the first bin and 50% in the second bin, 130 would be 100% in the first bin.

Does that sound like nonsense or a reasonable solution?

What would be a better way?

L_R_T
  • 55
  • 6
  • 2
    Why not just leave it as continuous? – Peter Flom May 15 '14 at 13:39
  • Classification or regression? What is your response and do you have any other covariates? – M. Berk May 15 '14 at 13:39
  • @PeterFlom It is assumed that exercise in specific bins produces specific physiological adaptations. In order to be able to quantify how an exercise session contributed to the various physiological adaptations, binning seems attractive, but... – L_R_T May 15 '14 at 13:42
  • 1
    Yes, OK, but the assumptions about the bins are designed to allow a person who is exercising to make a decision about how hard to exercise; they are coarser then the raw data. It seems unlikely that the physiology of a heart rate of 135 is the same as 125 but different from 136. So, you should account for the binning at a later stage in the analysis - when you are making recommendations. – Peter Flom May 15 '14 at 14:17
  • 1
    @Laurent, you get quantification with regression also. Think of the simplest case: a line: you get both a $y$-intercept *quantity*, but also the quantity of how much $y$ changes by given a 1-unit increase in $x$. Of course the exciting and fun stuff comes with multiple regression and interactions, and nonlinear regression with possibly complex relationships between $y$ and a given $x$. But these are all *quantified*. – Alexis May 15 '14 at 14:27
  • Your proposal is not nonsense. Mathematically, it is a linear spline. Splines are well-studied in regression and often are a good choice to handle non-linear relationships among dependent and independent variables, as discussed (in detail) in [Frank Harrell's book](http://www.amazon.com/Regression-Modeling-Strategies-Applications-Statistics/dp/0387952322) *inter alia.* Splines are more flexible tools than your approach, though, indicating that a studied application of them cannot be any worse and could be somewhat (or perhaps a lot) better. – whuber May 15 '14 at 16:20
  • @PeterFlom Yes, I agree that any bin boundary will be arbitrary and hard to justify, but there is an expectation that certain zones have certain meaning. I must respect that convention. – L_R_T May 15 '14 at 17:56
  • @Alexis What would be x, what would be y? How would I glance at a regression line/curve and figure out how much time I spent in any give bin/zone? – L_R_T May 15 '14 at 17:58
  • 2
    If the data are continuous, what happens to someone whose heart rate is 135.6? – Glen_b May 15 '14 at 19:36
  • 3
    My point was **not binning is quantitative**. Binning suffers from loss of power and the potential for quite serious aggregation bias, this is true both theoretically, and in my experience with actual data. – Alexis May 15 '14 at 19:42
  • @Alexis I got you, but this is as much a data visualization problem as it is a data analysis problem. – L_R_T May 16 '14 at 00:49
  • 1
    Therefore why not use the most assumption-free data visualization techniques? I.e. nonparametric smoothing regression that lets the relationship between $y$ (outcome) and $x$ predictor speak for itself, without imposing arbitrary (and potentially quite biasing) aggregation structure onto it? – Alexis May 16 '14 at 02:27
  • @Alexis Thank you for your help, I'll look into that. – L_R_T May 16 '14 at 13:52
  • 3
    Possible duplicate of [What is the justification for unsupervised discretization of continuous variables?](https://stats.stackexchange.com/questions/104402/what-is-the-justification-for-unsupervised-discretization-of-continuous-variable) – kjetil b halvorsen Dec 19 '18 at 12:10
  • Other dup: https://stats.stackexchange.com/questions/68834/what-is-the-benefit-of-breaking-up-a-continuous-predictor-variable?noredirect=1&lq=1 – kjetil b halvorsen Dec 19 '18 at 12:30
  • This https://stats.stackexchange.com/a/232088/99274 shows the effect of mean x-value bin assignments. – Carl Dec 19 '18 at 22:31

0 Answers0