Optimal multivariate binning where the cut-points must be the same for all observations

Question

I have a large data set with many discrete and continuous variables. All the variables are present in every observation. I want to explain (the log of) one continuous variable using all the other variables I have selected. For this purpose I wish to divide the independent continuous variables into bins so as to maximize the between-bins variation in the dependent variable relative to the within-bin bin variation, subject to the constraint that the break-points in the binned variables must be the same for all observations. Within- and between-bin variation should be given a multivariate interpretation, i.e. single bins are formed as the cross product of all the binning cuts. I'd also like to assure that every bin includes some minimal number of observations, but I am guessing that I will have to do this "by hand," e.g. by setting a maximum number of bins for each variable individually.

Can anyone recommend an algorithm or package for this purpose? I expect to do the work in R.

I should be clear that I am not requiring that bin widths are equal.

I don't know if this makes any difference, but my purpose in doing the binning is to set up pseudo-strata for a complex survey where the stratification is not published for confidentiality reasons. I have replicate weights for recent years, but I want to come up with something I can uses in every year and see if the variance estimates maintain a constant ratio to the replicate variances.

score 1 · Answer 1 · answered Jan 26 '19 at 23:36

Binning continuous predictors in this way is probably not a good idea. Cut-points determined on a particular data sample are likely not to work as well on later data samples. You will typically get best performance by first modeling the continuous predictors as continuous, using spline fits or other transformations if you are doing linear regression. This page and pages linked from it go into detail, with examples. (As a comment on that page notes, if you are using tree-based approaches rather than linear regression then you are already binning the continuous variables but in a more reliable way.)

If you have to define some cutoffs, say for stratification in running a later study, you can use the information from the full model, based on continuous predictor values, along with estimates of prevalences of combinations of predictor values, to make informed choices of where to draw cutoffs. You could evaluate how well the post-model binning works by repeating your entire process on multiple bootstrap samples from your data. Note that it will become increasingly difficult to get multi-dimensional bins that have reasonable numbers of cases as the number of predictors increases in any event.

But if you do ultimately have to define bins, do the model properly first, with continuous predictors, and define your bins second.

+1 for starting off by saying that coarsening a continuous predictor is probably not a good idea — Patrick Coulombe, Oct 13 '19 at 15:37

Optimal multivariate binning where the cut-points must be the same for all observations

1 Answers1