Does binning of ranges make sense for a Random Forest?

Question

I'm looking at example solutions to Kaggle's Titantic competition. In short: Given passenger information such as age, sex, fare, class, can you predict whether or not they survived?

A lot of people like to preprocess the data by binning age and fare into ranges.

But then they throw a Random Forest at the problem, and I feel like binning is a waste of information here, because when building a decision tree, the algorithm kinda does the binning for you, based on the best split.

Is that intuition correct, or does the simplification achieved by binning make up for the loss of detailed information about the distribution?

score 2 · Answer 1 · answered Feb 27 '18 at 06:45

I agree with you that binning doesn't make much sense in the general case. The random forest itself already performs binning of the input space (as you mentioned), and it does so adaptively (i.e. in a way that allows good predictions). Manual binning destroys information, and forces trees to split at pre-defined bin edges, which might be suboptimal. I could see a use for it if there were a priori knowledge that a particular binning was meaningful/good. Or, perhaps as an approximation strategy to save time/space in the case of very large datasets (quantized values require fewer bits to store, and there are fewer possibilities to search through when seeking a threshold for splitting a node).

Does binning of ranges make sense for a Random Forest?

1 Answers1

Linked