What is bucketization?

Question

I've been going around to find a clear explanation of "bucketization" in machine learning with no luck. What I understand so far is that bucketization is similar to quantization in digital signal processing where a range of continous values is replaced with one discrete value. Is this correct?

What are the pros and cons (besides the obvious impact of losing information) of applying bucketization? Are there any rules of thumb on how to apply bucketization? Are there any guidelines/algorithms for applying bucketization ahead of applying machine learning?

I may not have the correct answer but **Coarse-classing and Fine-classing** [WoE and IV] helps in bucketization. Pardon me if this is not what you had expected. — Srikanth Guhan, May 21 '15 at 11:14

score 4 · Answer 1 · answered May 21 '15 at 13:55

According to the article "High Level Versus Low Level Data Science" bucketization is

The bucketization step (sometimes called multivariate binning) consists of identifying metrics (and combinations of 2-3 metrics) with high predictive power, combine and bin them appropriately, to reduce intra-bucket variance while keeping the buckets big enough.

So my understanding is that you greedily bin the data according to the most predictive features, then analyze the subgroups.

Matthew Drury · Accepted Answer · 2015-05-21T15:04:03.893

This is a wide topic, and you will encounter a range of reasons why data should be, or already is, bucketized. Not all of them are related to predictive accuracy.

First, here's an example where a modeler may want to bucketize. Suppose I'm building a credit scoring model: I want to know people's propensity to default on a loan. In my data, I have a column indicating the status of a credit report. That is, I ordered the report from a rating agency, and the agency returned, say, their proprietary score, along with a categorical variable indicating the reliability of this score. This indicator may be much more fined grained than I need for my purposes. For example, the "no enough information for reliable score" may be broken out into many classes like "less than 20 years of age", "recently moved to the country", "no prior credit history", etc. Many of these classes may be sparsely populated, and hence rather useless in a regression or other model. To deal with this, I may want to pool like classes together to consolidate the statistical power into a "representative" class. For example, it may only be reasonable for me to use a binary indicator "good information returned" vs. "no information returned". In my experience, many applications of bucketization fall into this general collapsing of sparsely populated categories type.

Some algorithms use bucketization internally. For example, trees fit inside of boosting algorithms often spend the majority of their time in a summarization step, where the continuous data in each node is discretized and the mean value of the response in each bucket is calculated. This greatly reduces the computational complexity of finding an appropriate split, without much sacrifice in accuracy due to the boosting.

You may also simply receive data pre-bucketized. Discrete data is easier to compress and store - a long array of floating point numbers is nigh incompressible, but when discretized into "high", "medium" and "low", you can save a lot of space in your database. Your data may also be from a source targeted at a non-modeling application. This tends to happen a lot when I receive data from organizations that do less analytical work. Their data is often used for reporting, and is summarized to a high level to help with the interpretability of the reports to laymen. This data can still be useful, but often some power is lost.

What I see less value in, though its possible I may be corrected, is the pre-bucketization of continuous measurements for modeling purposes. There are plenty of very powerful methods for fitting non-linear effects to continuous predictors, and buckeization removes your ability to use these. I tend to see this as a bad practice.

What is bucketization?

2 Answers2

Linked