Optimal Binning with respect to a given response variable

Question

I'm looking for optimal binning method (discretization) of a continuous variable with respect to a given response (target) binary variable and with maximum number of intervals as a parameter.

example: I have a set of observations of people with "height" (numeral continuous) and "has_back_pains" (binary) variables. I want do discretize height into 3 intervals (groups) at most with different proportion of people with back aches, so that the algorithm maximizes the difference between the groups (with given restrictions for instance, that each interval has at least x observations).

The obvious solution to this problem would be to use decision trees(a simple one-variable model) , but I can't find any function in R that would have "maximal number of branches" as a parameter - all of them divide the variable into 2 gropus (<=x and >x). SAS miner has a "maximum branch" parameter but I'm looking for a non commercial solution.

some of my variables have just a few unique values (and could be treated as discrete variables) but I want to discretize them as well into a smaller number of intervals.

The closest solution to my problem is implemented in the smbinning package in R (which relies on ctree function from party package) but it has two drawbacks: it's impossible to set the number of intervals (however, you can find a way around it by changing the p parameter) and it doesn't work when data vector has less than 10 unique values. Anyway, you can see the example output here(Cutpoint and Odds columns are crucial):

Cutpoint CntRec CntGood CntBad CntCumRec CntCumGood CntCumBad PctRec BadRate   Odds  LnOdds     WoE     IV
1   <= 272   9081     169   8912      9081        169      8912 0.1874  0.9814 0.0190 -3.9653 -0.6527 0.0596
2   <= 311   8541     246   8295     17622        415     17207 0.1762  0.9712 0.0297 -3.5181 -0.2055 0.0068
3   <= 335   2986     163   2823     20608        578     20030 0.0616  0.9454 0.0577 -2.8518  0.4608 0.0163
4  Missing  27852    1125  26727     48460       1703     46757 0.5747  0.9596 0.0421 -3.1679  0.1447 0.0129
5    Total  48460    1703  46757        NA         NA        NA 1.0000  0.9649 0.0364 -3.3126  0.0000 0.0956

Oh, I'm fully aware that binning results in information loss and that there are better methods, but I'm going to use it for data visualization and treat those variables as a factor.

SPSS has Optimal Binning command. Google `SPSS Algorithms Optimal Binning`. — ttnphns, Apr 29 '15 at 00:51
Have you seen this post http://stackoverflow.com/questions/7018954/binning-continuous-variables-by-iv-value-in-r It mentions Information Value usage but not clear what it means by IV=1, or doesn't explain how to get that — adam, May 22 '15 at 13:59

score 8 · Answer 1 · edited Dec 23 '15 at 00:53

While reading this book here (Nagarajan, 2103 [1]), I came across this valuable information that I am shamelessly citing here:

Using prior knowledge on the data. The boundaries of the intervals are defined, for each variable, to correspond to significantly different real-world scenarios, such as the concentration of a particular pollutant (absent, dangerous, lethal) or age classes (child, adult, elderly).
Using heuristics before learning the structure of the network. Some examples are: Sturges, Freedman-Diaconis, or Scott rules (Venables and Ripley, 2002).
Choosing the number of intervals and their boundaries to balance accuracy and information loss (Kohavi and Sahami, 1996), again one variable at a time and before the network structure has been learned. A similar approach considering pairs of variables is presented in Hartemink (2001).
Performing learning and discretization iteratively until no improvement is made (Friedman and Goldszmidt, 1996).

These strategies represent different trade-offs between the accuracy of the discrete representation of the original data and the computational efficiency of the transformation.

This information is provided, in case you want to justify the binning method you wish to use and not just use a package directly.

[1]: Nagarajan R. (2013),
Bayesian Networks in R, with Applications in Systems Biology
Springer

score 4 · Answer 2 · edited May 23 '17 at 12:39

Try Information package for R. https://cran.r-project.org/web/packages/Information/Information.pdf https://cran.r-project.org/web/packages/Information/vignettes/Information-vignette.html

Information package has functionality for calculating WoE and IV (number of bins is a flexible parameter, default is 10) and is a handy instrument for data exploration and consequently for binning. The output doesn't contain Odds, though; and it is not possible to specify zero as a separate bin (for my tasks zero is often a valid bin of its own right); and it would be nice to get output from Information package that would be like that of smbinning. However, that being said about nice-to-have but still not available features of the Information package, other R packages for WoE and IV (woe, klaR) didn't make impression of as useful instruments as the Information package, in fact I failed to run them after 2-3 attempts. For the task of dscretisation/binning, Information and smbinning packages can work together nicely, with some manually editing and reviewing the outputs in a spreadsheet editor, and their combined output is most likely to be sufficient for the purpose.

For actual binning I used data.table instead of cut() function. See link to my post below, it contains generic code in the very bottom of the initial question: https://stackoverflow.com/questions/34939845/binning-variables-in-a-dataframe-with-input-bin-data-from-another-dataframe

Hope it helps.

@kjetil, kjetil b halvorsen, you are right. Information package has functionality for calculating WoE and IV (number of bins is a flexible parameter, default is 10) and is a handy instrument for data exploration and consequent for binning. The output doesn't contain Odds, though. And it is not possible to specify zero as a separate bin (for my tasks zero is often a valid bin of its own right). Other R packages for WoE and IV (woe, klaR) didn't make impression of as useful instruments as the Information package. So Information and smbinning package can work together nicely as a combination. — Aktan, Sep 05 '16 at 03:21

Optimal Binning with respect to a given response variable

2 Answers2

Linked

Related