How does a decision tree split on a categorical variable?

Question

Some implementations of decision trees (eg cran/tree) can split on categorical variables where the split separates the variable into 2 groups:

> library("tree")
> df <- read.csv('https://forge.scilab.org/index.php/p/rdataset/source/file/master/csv/datasets/Titanic.csv')

> tree(Survived ~ Class, data=df, weights=Freq, minsize=10)
node), split, n, deviance, yval, (yprob)
      * denotes terminal node

1) root 2201 2769 No ( 0.6770 0.3230 )  
  2) Class: 3rd,Crew 1591 1772 No ( 0.7549 0.2451 ) *
  3) Class: 1st,2nd 610  844 Yes ( 0.4738 0.5262 ) *

For a numerical variable, splits are calculated by trying all possible splits between the minimum and maximum value of the variable.

How are the splits enumerated for a categorical variable? Do you have to try all combinations of values? (this would seem to be computationally crazy) Or is there a heuristic you use to limit the number of combinations tried?

To my understanding, Weka's J48 is supposed to be an implementation of C4.5, which is contrasted from CART in several ways, as described in [Top 10 algorithms in data mining](http://www.cs.umd.edu/~samir/498/10Algorithms-08.pdf) (which has Quinlan as a coauthor, so one would guess that's accurate about the distinctions) — Glen_b, Jun 06 '17 at 00:02
thanks for the link @Glen_b, in the article they say "An attribute A with discrete values has by default one outcome for each value, but an option allows the values to be grouped into two or more subsets with one outcome for each subset." But they don't say how this grouping is achieved -- this is what I'm asking in my question. — Max Flander, Jun 06 '17 at 00:11
BTW i've removed CART from the title if this makes things clearer — Max Flander, Jun 06 '17 at 00:15
[This image](http://facweb.cs.depaul.edu/mobasher/classes/ect584/WEKA/classify/figure24-b.gif) shows that you can get more than a binary split (see the region variable), (or equivalently, p22 of [this](http://csed.sggs.ac.in/csed/sites/default/files/WEKA%20Explorer%20Tutorial.pdf), see the outlook variable -- also see p7 [here](https://moodle.umons.ac.be/pluginfile.php/43703/mod_resource/content/2/WekaTutorial.pdf)). If you question is "how do I make Weka do this?" it will be [off topic here](http://stats.stackexchange.com/help/on-topic) (see under *Programming*). — Glen_b, Jun 06 '17 at 00:19
This document (again, Quinlan is a coauthor): [*Decision Tree Discovery*](http://ai.stanford.edu/~ronnyk/treesHB.pdf) under "Candidate tests" (p3) gives a brief discussion of how C4.5 deals with categorical variables. If j48 implements C4.5 correctly (you can always check the code), presumably it works like this. — Glen_b, Jun 06 '17 at 00:35
@Glen_b, i'm having trouble understanding that explanation ... is it talking about grouping or is it talking about n-ary splits? my question is about binary splits with grouped values; i've added some code to my question that hopefully makes it a bit clearer ... thanks for you patience with me on this ! — Max Flander, Jun 06 '17 at 00:57

How does a decision tree split on a categorical variable?

0 Answers0