What is the justification for unsupervised discretization of continuous variables?

Question

A number of sources suggest that there are many negative consequences of the discretization (categorization) of continuous variables prior to statistical analysis (sample of references [1]-[4] below).

Conversely [5] suggests that some machine learning techniques are known to produce better results when continuous variables are discretized (also noting that supervised discretization methods perform better).

I am curious if there are any widely accepted benefits or justifications for this practice from a statistical perspective?

In particular, would there be any justification for discretizing continuous variables within a GLM analysis?

[1] Royston P, Altman DG, Sauerbrei W. Dichotomizing continuous predictors in multiple regression: a bad idea. Stat Med 2006;25:127-41

[2] Brunner J, Austin PC. Inflation of Type I error rate in multiple regression when independent variables are measured with error. The Canadian Journal of Statistics 2009; 37(1):33-46

[3] Irwin JR, McClelland GH. Negative consequences of dichotomizing continuous predictor variables. Journal of Marketing Research 2003; 40:366–371.

[4] Harrell Jr FE. Problems caused by categorizing continuous variables. http://biostat.mc.vanderbilt.edu/twiki/bin/view/Main/CatContinuous, 2004. Accessed on 6.9.2004

[5] Kotsiantis, S.; Kanellopoulos, D. "Discretization Techniques: A recent survey". GESTS International Transactions on Computer Science and Engineering 32(1):47–58.

Discretizing them compared to doing what else? If the alternative's considered to be treating the relation between predictor & response as linear then it's not surprising that discretization can sometimes give a better fit. See [here](http://stats.stackexchange.com/questions/68834/). — Scortchi - Reinstate Monica, Jun 23 '14 at 16:01

score 8 · Accepted Answer · answered Jun 24 '14 at 20:28

The purpose of statistical models is to model (approximate) an unknown, underlying reality. When you discretize something that is naturally continuous, you are saying that all the responses for a range of predictor variables are exactly the same, then there is a sudden jump for the next interval. Do you really believe that the natural world works by having a large difference in the response between x-values of 9.999 and 10.001 while having no difference between 9.001 and 9.999 (assuming one of the intervals is 9-10)? I cannot think of any natural processes that I would consider plausibly working that way.

Now there are many natural processes that act in a non linear manner, the change from 8 to 9 in the predictor may make a very different change in the response than a change from 10 to 11. And therefore a discretized predictor may fit better than a linear relationship, but that is because it is allowed more degrees of freedom. But, there are other ways to allow additional degrees of freedom, such as polynomials or splines, and these options allow us to penalize to get a certain level of smoothness and maintain something that is a better approximation of the underlying natural process.

Youloush · Answer 2 · 2014-06-25T11:00:14.247

2

Edit : Because of the trend of other answers I'm seeing, a short disclaimer : my answer is motivated by a machine learning perspective, and not statistical modelling.

Some models, such as Naive Bayes, do not function with continuous features. Discretizing the features can help use them perform (much) better. Generally, models which do not rely on the "numerical" character of the feature (decision trees come to mind) are not impacted too much as long as the discretization is not too brutal. Some other models however will underperform vastly if discritization is too important. For example, GLMs will gain absolutely no benefit from the process.
In some cases, when memory / processing time become limiting factors, feature discretization allows to aggregate a dataset, reducing its size and its memory / computing time consumption.

So the bottom line is that if you are not computationally limited, and if your model does not absolutely require discrete features, do not run feature discretization. Otherwise, by all means consider it.

edited Jun 25 '14 at 11:00

answered Jun 23 '14 at 16:41

Youloush

878
5
10

6

A method that does not use the numerical nature of the variable should be avoided at any rate. – Frank Harrell Jun 24 '14 at 19:37
1

That is plain false. Decision trees, Random Forests, Gradient Boosted DT are all excellent algorithms and do not take the numerical nature of the variables into account, except for their ordering. Naive Bayes can often be a more than sufficient tool for basic classification tasks. – Youloush Jun 25 '14 at 08:26
7

There are several misunderstandings. First you assume that discretization at least uses the ordinal nature of continuous predictors; it does not. Then you confuse pre-binning (a disaster) with binning during the predictive algorithm (a small disaster). You assume that classification leads to optimum decisions as opposed to prediction. You assume that categorization of inputs is the way to go, as opposed to categorization of outputs (predicted risk, then apply loss function to get optimum decision). Finally, you imply it is OK to make true smooth relationships discontinuous. – Frank Harrell Jun 25 '14 at 13:16
Discretization of a continuous feature does usually use its ordinal nature (such as the methods in the OP's references). Second, I have no idea what you're rambling about, opposing classification, prediction or "decision". I have cited one example of algorithm (Naive Bayes) which benefits from feature discretization, and happens to be used for classification. Finally "smooth relationships" do not actually exist in actual "real" datasets. You are merely modelling them and finding the best fit. – Youloush Jun 25 '14 at 22:09
I'll end this comment war (at least on my side) by adding that this is basically the age-old argument opposing statisticians and machine learning scientists ; with ML seeking to minimize prediction error and statisticians trying to build a model for an "underlying reality", cf. Greg's answer. – Youloush Jun 25 '14 at 22:13
3

Since prediction error is an improper accuracy scoring rule, that statement says a lot about ML. And I don't know of many statisticians who really seek an underlying reality. We are content to develop various approximations or stand-ins for the reality, as well as just plain letting the data speak for themselves. – Frank Harrell Jun 25 '14 at 22:31
1

P.S. Smooth relationships exist as an underlying truth in almost all datasets not containing time as the sole predictor. Obviously, data points are discrete. That has *absolutely* nothing to do with whether you choose a smooth modeling approach or not. – Frank Harrell Jun 25 '14 at 22:32
Let us [continue this discussion in chat](http://chat.stackexchange.com/rooms/15344/discussion-between-youloush-and-frank-harrell). – Youloush Jun 25 '14 at 23:20

What is the justification for unsupervised discretization of continuous variables?

2 Answers2

Linked