Is it better to use the original continuous or discretized data in building a model?

Question

I am currently competing in a Kaggle competition and wondering is it a good choice to discretize particular variables.

I saw that some specific continuous variables have almost all values that are easy to discretize - 0.255, 0.125, 0.005 (in this case I can convert it by multiplying the variable by 200 and get values in range from [0..200]).

But statistically, can I benefit from doing this or is it just easier to compute, because the model will work with integers? Also, there is a small percentage of the data that can't be rounded when multiplied by 200 in this case (e.g. < 5%)... Does losing the precision necessarily mean losing the quality of the data?

If not, why aren't all discrete variables scaled from [0..1] (except for the efficiency)?

Tim · Accepted Answer · 2015-12-20T19:07:05.760

I saw that some specific continuous variables have almost all values that are easy to discretize - 0.255, 0.125, 0.005 (in this case I can convert it by multiplying the variable by 200 and get values in range from [0..200]).

Multiplying anything by constant won't make it discrete. Discrete random variables may take distinct and countable values. It does not matter how the values are coded and you don't have to store them as integers (e.g. you could use names for each category). Continuous random variables can take infinite number of possible values (e.g. distance, or passing time). See also Should types of data (nominal/ordinal/interval/ratio) really be considered types of variables? thread.

But statistically, can I benefit from doing this or is it just easier to compute, because the model will work with integers? Also, there is a small percentage of the data that can't be rounded when multiplied by 200 in this case (e.g. < 5%)... Does losing the precision necessarily means losing the quality of the data?

There is not benefits whatsoever because you can apply the same mathematical operations to integers and real numbers. However real discreditizing, i.e. converting continuous variable into discrete by transforming its values to a countable number of categories (e.g. converting age into "young" and "old" groups) would make a difference. Imagine you measured human age and converted it into two groups 0-21 years to "young" and 22 and more years to "old" (or any other arbitrary choice), then it leads to saying that while there is a difference between 21- and 22-year-olds (different categories), there is not difference between 22- and 23-year-olds (same category). See What is the benefit of breaking up a continuous predictor variable? thread for learning more.

If not, why aren't all discrete variables scaled from [0..1] (except for the efficiency)?

There are multiple reasons, I'll name two that seem to be most obvious. First, if some variable can possibly take values from $-\infty$ to $\infty$ then you simply cannot convert it into bounded variable! Second, would it be anyhow informative if you learned that someones age is 0.235? You may ask: "0.235 compared to what?" and this would be a valid question. In most cases we want our variables to be easily interpretable. In many cases feature scaling and normalization or other transformations (e.g. taking logs, squaring, taking square roots, absolute values etc.) are used with continuous random variables, but only when there is any reason for it (see Variables are often adjusted (e.g. standardised) before making a model - when is this a good idea, and when is it a bad one?).

Is it better to use the original continuous or discretized data in building a model?

1 Answers1