3

SPSS has an optimal binning function that helps categorizing into meaningful intervals continuous predictors when a binary response variable exists. I was looking for an equivalent function in R but I'm not finding any. I'm not sure that using bins derived by CART or CTREE could be equivalent.

Giorgio Spedicato
  • 3,444
  • 4
  • 29
  • 39
  • 8
    In practice very, very few people know both SPSS and R in any depth. I think you would need to be much more precise what this "optimal binning" is to get an answer. That aside, binning a continuous predictor is widely deprecated as very poor statistical practice, in my view fairly. http://biostat.mc.vanderbilt.edu/wiki/Main/CatContinuous is a good introduction. In addition, "optimal binning" (not your name, presumably) is a loaded term! – Nick Cox Oct 14 '14 at 10:13
  • 5
    See also e.g. http://stats.stackexchange.com/questions/68834/what-is-the-benefit-of-breaking-up-a-continuous-predictor-variable in this forum. – Nick Cox Oct 14 '14 at 10:35
  • I agree that restricted cubic splines or non parametric smoothers takes better into account non - linearity. Nevertheless the algorithm that this analysis will derive cannot make use of such smoothers. – Giorgio Spedicato Oct 14 '14 at 15:05
  • There is a `cut` function and in documentation of `?hist` you can find info about algorithms that choose "optimal" number of bins for histogram. See also http://stats.stackexchange.com/questions/163778/how-do-you-find-a-cutting-point-strong-slope-within-one-dimensional-data/163787#163787 – Tim Oct 17 '15 at 07:57

2 Answers2

3

There is now a package call "smbinning" that longs for Optimal Binning for Scoring Modeling since early 2015. It gives you the optimal cut point for a numeric variable, more precisely, optimizing the information value. It is able to handle categorical variable and missing value as well.

For example:

smbinning(df, y , x, p = 0.05)
  • df <- Data frame
  • y <- Binary dependent variable
  • x <- numeric independent variable
  • p <- Percentage of records per bin

It returns a list that contains the information value, Information value table and others. you may find detail in the documentation at CRAN or http://www.scoringmodeling.com/

Anthony Lei
  • 371
  • 1
  • 10
  • 3
    To be honest not the biggest fan of the smbinning package. I haven't coded anything better but the coding in the package feels "amateurish", and it fails in many of the test cases I tried. I don't recommend smbinning at v0.2. – xiaodai Nov 24 '15 at 02:51
2

You can test the discretization package and the cutPoints function : http://cran.r-project.org/web/packages/discretization/discretization.pdf.