I'm not sure the best way to explain this, so let me give an example that motivates my question. I have tried reading the RPART manual, documentation, and its code, but I have not been able to resolve this.
Let's build two classification trees from the iris data: one using only 2 predictors (Sepal.Width & Petal.Width), and the other using all predictors. Some things to examine in the code and results below:
- The primary splitters are different in each tree
- The cp tables have different xerror and xstd values for the first split
- The classification summaries are the same (i.e., same class probabilities, elements in each split)
- Manual calculations (not shown, but can be made available) show that the primary splitters in each tree have the same gini information gain of 0.333
When building out the tree, the algorithm is greedy (and so it doesn't look forward). If the gini info gains are the same for these primary splitters, how does RPART pick the "winner" to continue building the tree? I understood that the xerror values come from validation, which occurs after the maximal tree is built. Hence, it doesn't make sense to compute xerror when first determining splitters.
Edit: Ignoring RPART, how should this situation be handled theoretically? Does it even matter which splitter is picked when they both have maximal information gain?
Code and Results
(1) Model with Sepal.Width & Petal.Width
> iris.rpart2d = rpart(iris$Species ~ iris$Sepal.Width + iris$Petal.Width, data=iris, method="class",)
> iris.rpart2d
n= 150
node), split, n, loss, yval, (yprob)
* denotes terminal node
1) root 150 100 setosa (0.33333333 0.33333333 0.33333333)
2) iris$Petal.Width< 0.8 50 0 setosa (1.00000000 0.00000000 0.00000000) *
3) iris$Petal.Width>=0.8 100 50 versicolor (0.00000000 0.50000000 0.50000000)
6) iris$Petal.Width< 1.75 54 5 versicolor (0.00000000 0.90740741 0.09259259) *
7) iris$Petal.Width>=1.75 46 1 virginica (0.00000000 0.02173913 0.97826087) *
> printcp(iris.rpart2d)
Classification tree:
rpart(formula = iris$Species ~ iris$Sepal.Width + iris$Petal.Width,
data = iris, method = "class")
Variables actually used in tree construction:
[1] iris$Petal.Width
Root node error: 100/150 = 0.66667
n= 150
CP nsplit rel error xerror xstd
1 0.50 0 1.00 1.12 0.053267
2 0.44 1 0.50 0.58 0.059643
3 0.01 2 0.06 0.07 0.025833
(2) Model with all predictors
> iris.rpartall = rpart(iris$Species ~ ., data=iris, method="class",)
n= 150
node), split, n, loss, yval, (yprob)
* denotes terminal node
1) root 150 100 setosa (0.33333333 0.33333333 0.33333333)
2) Petal.Length< 2.45 50 0 setosa (1.00000000 0.00000000 0.00000000) *
3) Petal.Length>=2.45 100 50 versicolor (0.00000000 0.50000000 0.50000000)
6) Petal.Width< 1.75 54 5 versicolor (0.00000000 0.90740741 0.09259259) *
7) Petal.Width>=1.75 46 1 virginica (0.00000000 0.02173913 0.97826087) *
> printcp(iris.rpartall)
Classification tree:
rpart(formula = iris$Species ~ ., data = iris, method = "class")
Variables actually used in tree construction:
[1] Petal.Length Petal.Width
Root node error: 100/150 = 0.66667
n= 150
CP nsplit rel error xerror xstd
1 0.50 0 1.00 1.23 0.047053
2 0.44 1 0.50 0.73 0.061215
3 0.01 2 0.06 0.11 0.031927