0

As discussed here, In most cases, bin a continuous variable to discrete one is a bad idea. But when it will be good?

May attempt to answer:

  • Can I think binning will increase the variance of the model? As suggested in this post? Should we bin continuous variables?

  • If a variable has some extreme outliers or lots noise (say, income, there are persons make billions, and some data points are incorrect collected and shown as 0), binning might be better?

Haitao Du
  • 32,885
  • 17
  • 118
  • 213
  • *none linear* relationships? :) – Richard Hardy Nov 14 '16 at 13:50
  • See [What is the benefit of breaking up a continuous predictor variable?](http://stats.stackexchange.com/q/68834/17230). A brief riposte to your 2nd point is here: http://stats.stackexchange.com/questions/68834/what-is-the-benefit-of-breaking-up-a-continuous-predictor-variable/68839#comment228826_117994 – Scortchi - Reinstate Monica Nov 14 '16 at 14:17
  • Didn't you ask this before: http://stats.stackexchange.com/questions/230750/when-should-we-discretize-binning-continuous-independent-variables-features-an/230783#230783 ? – Matthew Drury Nov 14 '16 at 15:00
  • @MatthewDrury wow, it is surprising I do not remember my own questions. Worked overtime too much recently. – Haitao Du Nov 14 '16 at 15:01
  • 1
    Heh. Occasionally I find an answer and think "What, that's so wrong!" and then I scroll down, and it's me from a year ago. – Matthew Drury Nov 14 '16 at 15:02
  • @MatthewDrury but this question is good that Scortchi provided some links I did not found. Also, It remains me to study your answer and code carefully and make notes ! – Haitao Du Nov 14 '16 at 15:05
  • Eep. That code is, not so good. – Matthew Drury Nov 14 '16 at 16:46
  • 1
    Binning, taken to an extreme, puts all data into a single category--which obviously has no variance. Therefore binning *decreases* variance; it does not increase it. However, if you are referring to the variance of the *error* in a regression model, then binning a regressor (assuming the bins continue to be treated numerically rather than as factors) typically will *increase* the error variance because it destroys the (presumed) linear relationship between regressor and response, introducing "noise." – whuber Nov 14 '16 at 17:16

0 Answers0