2

I read that averaging and binning a continuous predictor variable is in general a bad idea because it's always better to fit the continuous relationship through splines, poly and all of that. Sure, I agree, especially for smaller, accurately measured data sets.

But what about big data and exponential distributions, where noise is more frequent and we don't necessarily want to skew the coefficients towards the center of the distribution, where we have most of the observations (although less interesting for our analysis)? Doesn't binning the predictors and the response variable reduce noise and improve our analysis for the full distribution?

Robert Kubrick
  • 4,078
  • 8
  • 38
  • 55
  • 1
    What does "noise is more frequent" mean? Isn't there noise in every observation? How do you get more frequent than that? What exactly is it you're trying to achieve? – Glen_b Sep 17 '14 at 01:31
  • See [this question](http://stats.stackexchange.com/questions/86536/can-the-use-of-dummy-variables-reduce-measurement-error/87313) – Scortchi - Reinstate Monica Nov 18 '14 at 13:47

1 Answers1

2

How would binning a variable "reduce noise"? It seems to me, whatever sort of measurement error your variables have, binning is always adding additional measurement error in your variables. I'm particularly skeptical of binning your outcome variable.

That said, I don't oppose binning in all circumstances. Binning predictors is sometimes an effective way to model non-linear relationships. It's primary advantage over other approaches is that the coefficients are easily interpretable. Sometimes that sort of ease of interpretation is a high priority.

Regardless of whether you choose to bin or not, I urge you to be careful to avoid overfitting. You want a better fit, not an overfit.

Michael Bishop
  • 2,171
  • 3
  • 21
  • 31
  • 1
    Can elaborate a bit more about "binning predictors is an effective way to model non-linear relationships"? – Robert Kubrick Sep 16 '14 at 17:31
  • 1
    You misquote me, I said "*sometimes* an effective..." ;) But what I meant by that in comparison to a model with a continuous variable entered linearly, a binned version of that variable with multiple bins may lead to a model with better predictive performance. Predictive performance with a binned variable may be similar to a model with quadratic and cubic terms or a spline. I never bin outcome variables, and I only bin predictors in a regression when the underlying relationship is non-linear and I want coefficients that are easier to interpret than a poly/spline would be. – Michael Bishop Sep 16 '14 at 22:05
  • Ok. I don't see why binning a predictor would improve the non-linear relationship. I do see that in my $R^2$ but I can't get the intuition... – Robert Kubrick Sep 16 '14 at 23:27
  • 1
    Well, if the true relationship is non-linear, then a binned variable w/ > 2 bins will result in predictions that are non-linearly related to the predictor resulting in a better fit... it's the same reason adding polynomial terms may improve the fit. Try this: fit two models 1) with the continuous predictor and save the residuals, 2) with the binned predictor. Then plot a boxplot of the residuals grouped by the binned values of the predictor. The residuals from the binned model will be more closely centered around zero. – Michael Bishop Sep 17 '14 at 02:00