- When should one consider converting continuous variable into categorical variable ? Are there guidelines ? Is it justified to bin skewed variable ?
- How should I determine the range / binning when I do the conversion ? For example, if I have Annual Income per year as a predictor how do I decide the the range / bins as 20k to 50k, 51k to 80k and so on. Do the bins have to be equally spaced (fixed width) ?
- Is binning know to improve models like Random Forest ?
Asked
Active
Viewed 30 times
1

learner
- 537
- 2
- 8
-
51) When one is demonstrating poor practices, and perhaps not in any other situation. – Dave Jul 05 '20 at 12:12
-
@Dave: Let's say I have age as a predictor variable, probably 22 year old subject is not very different from 23 old subject if I were to consider the impact of age. Similarly for annual income of 70k versus 72k. Is it not better to put them in ranges to be able to explain the results better ? (Pardon my ignorance) – learner Jul 05 '20 at 12:17
-
51. Never, 2. It's arbitrary, 3, Some models, such a logistic regression, are better with categorical variables, but by categorizing to begin with, you've already lost. See 1. – Robert Long Jul 05 '20 at 12:20
-
2It's rarely a good idea to categorize some continuous predictor. One scenario where it makes sense is when this encodes knowledge that the analyst has. Examples could be (depends on whether it's relevant to what you are modeling): Is a person younger or older than the legal limit for buying alcohol? Is a person old enough to be covered under some health insurance scheme for the elderly? Does this income fall into a certain income tax bracket? – Björn Jul 05 '20 at 12:29
-
1@Björn good example. Despite my slightly flippant comment above, the correct answer is "very rarely". Another example where it would be appropriate is is in data protection / privacy, as part of pseudonymization. – Robert Long Jul 05 '20 at 12:35
-
@Björn: In addition to your examples, could a skewed variable be a justified candidate to discretization ? Let's say candidates of my survey typically belong to a certain age range (young) although some portion of them are older too. – learner Jul 05 '20 at 14:14
-
Not sure why that would make categorization more or less useful. By the way in the examples I gave, I would also still usually add the non categorized variable to the model. – Björn Jul 05 '20 at 21:11