A decision tree (or random forest) tries to find a threshold that minimizes impurity, typically entropy or gini. This assumes the values are
- totally ordered and
- one decision has one threshold, as opposed to for example two thresholds where only the middle part belongs to one class, and the outside to another.
Obviously you can simulate multiple thresholds by multiple decision nodes, but the question is whether the greedy way of finding them will actually find optimal or even sensible ones. But that goes beyond my question a bit.
The second assumption is a practical one, because obviously it complicates stuff if you need to find multiple thresholds, and you also need to deal with the trivial but useless solution where every training sample has a threshold. Although I know that decision trees as known to overfit.
Sometimes input is not a continuous number, as a feature might be categorical, for example "red", "green", "blue", "purple". In this case, one can use one hot encoding. In general, the 'minimize impurity' strategy used for continuous values could select a threshold of 0.5 and discriminate by exactly one class, for example
all but "red"
, orall but "green"
, but notall but "red" or "green"
Discriminating actual subsets, larger than size 1, obviously exponentially complicates finding this split, since there are $2^N$ possibilities for $N$ categories. Question: Nevertheless, I was wondering whether there is a better way for sklearn
DecisionTree
's or RandomForest
's to deal with categorical values.
Second question: In Azure for example (although I don't use is), you can mark categorical inputs, and it 'deals with it', I would guess by one-hot-encoding it, or does Azure do more intelligent things than one-hot-ending?
PS: I looked to the DecisionTree
-code for sklearn
, and it seems nodes are added to the tree by:
- sorting each feature
- iterating over features, and finding the threshold with lowest impurity
- selecting the features with lowest impurity and adding that as a node to the tree
The code clearly show a threshold-type of node being added:
But maybe I missed something, and smarter stuff exists for categorical values?