0

A decision tree (or random forest) tries to find a threshold that minimizes impurity, typically entropy or gini. This assumes the values are

  1. totally ordered and
  2. one decision has one threshold, as opposed to for example two thresholds where only the middle part belongs to one class, and the outside to another.

Obviously you can simulate multiple thresholds by multiple decision nodes, but the question is whether the greedy way of finding them will actually find optimal or even sensible ones. But that goes beyond my question a bit.

The second assumption is a practical one, because obviously it complicates stuff if you need to find multiple thresholds, and you also need to deal with the trivial but useless solution where every training sample has a threshold. Although I know that decision trees as known to overfit.

Sometimes input is not a continuous number, as a feature might be categorical, for example "red", "green", "blue", "purple". In this case, one can use one hot encoding. In general, the 'minimize impurity' strategy used for continuous values could select a threshold of 0.5 and discriminate by exactly one class, for example

  • all but "red", or
  • all but "green", but not
  • all but "red" or "green"

Discriminating actual subsets, larger than size 1, obviously exponentially complicates finding this split, since there are $2^N$ possibilities for $N$ categories. Question: Nevertheless, I was wondering whether there is a better way for sklearn DecisionTree's or RandomForest's to deal with categorical values.

Second question: In Azure for example (although I don't use is), you can mark categorical inputs, and it 'deals with it', I would guess by one-hot-encoding it, or does Azure do more intelligent things than one-hot-ending?

PS: I looked to the DecisionTree-code for sklearn, and it seems nodes are added to the tree by:

  • sorting each feature
  • iterating over features, and finding the threshold with lowest impurity
  • selecting the features with lowest impurity and adding that as a node to the tree

The code clearly show a threshold-type of node being added:

https://github.com/scikit-learn/scikit-learn/blob/f3320a6f1ab2d060fd49c94a6e3386291a684ec7/sklearn/tree/_tree.pyx#L244

But maybe I missed something, and smarter stuff exists for categorical values?

Herbert
  • 135
  • 11
  • If your categorical variable has few different levels, e.g., "red", "green", "blue", "purple", then it would be quite feasible to put everything "red" in the left branch, then try all $2^3=8$ possible ways of adding everything of zero, one, two or three specific colors to that left branch, and finally evaluate the purity gain. Are you sure no implementation tries something like this? – Stephan Kolassa Nov 06 '17 at 09:53
  • @StephanKolassa Basically that is my question. Are more complicated things than leave-one-out-nodes done in general, for example in sklearn or Azure. Moreover, your suggestion would not scale well. – Herbert Nov 06 '17 at 15:19
  • See my answer [here](https://stats.stackexchange.com/questions/398903/strange-encoding-for-categorical-features/414892#414892) for some ideas. – kjetil b halvorsen Jul 25 '19 at 16:45
  • Also see https://stats.stackexchange.com/questions/231285/dropping-one-of-the-columns-when-using-one-hot-encoding/329281#329281, https://stats.stackexchange.com/questions/390671/random-forest-regression-with-sparse-data-in-python/430127#430127, https://stats.stackexchange.com/questions/410939/label-encoding-vs-dummy-variable-one-hot-encoding-correctness/414729#414729 – kjetil b halvorsen Jan 25 '20 at 20:28

0 Answers0