How does RPART pick a splitter when there are at least 2 splits having maximal information gain?

Question

I'm not sure the best way to explain this, so let me give an example that motivates my question. I have tried reading the RPART manual, documentation, and its code, but I have not been able to resolve this.

Let's build two classification trees from the iris data: one using only 2 predictors (Sepal.Width & Petal.Width), and the other using all predictors. Some things to examine in the code and results below:

The primary splitters are different in each tree
The cp tables have different xerror and xstd values for the first split
The classification summaries are the same (i.e., same class probabilities, elements in each split)
Manual calculations (not shown, but can be made available) show that the primary splitters in each tree have the same gini information gain of 0.333

When building out the tree, the algorithm is greedy (and so it doesn't look forward). If the gini info gains are the same for these primary splitters, how does RPART pick the "winner" to continue building the tree? I understood that the xerror values come from validation, which occurs after the maximal tree is built. Hence, it doesn't make sense to compute xerror when first determining splitters.

Edit: Ignoring RPART, how should this situation be handled theoretically? Does it even matter which splitter is picked when they both have maximal information gain?

Code and Results

(1) Model with Sepal.Width & Petal.Width

> iris.rpart2d = rpart(iris$Species ~ iris$Sepal.Width + iris$Petal.Width, data=iris, method="class",)
> iris.rpart2d
n= 150 

node), split, n, loss, yval, (yprob)
      * denotes terminal node

1) root 150 100 setosa (0.33333333 0.33333333 0.33333333)  
  2) iris$Petal.Width< 0.8 50   0 setosa (1.00000000 0.00000000 0.00000000) *
      3) iris$Petal.Width>=0.8 100  50 versicolor (0.00000000 0.50000000 0.50000000)  
    6) iris$Petal.Width< 1.75 54   5 versicolor (0.00000000 0.90740741 0.09259259) *
        7) iris$Petal.Width>=1.75 46   1 virginica (0.00000000 0.02173913 0.97826087) *
> printcp(iris.rpart2d)

Classification tree:
rpart(formula = iris$Species ~ iris$Sepal.Width + iris$Petal.Width, 
    data = iris, method = "class")

Variables actually used in tree construction:
[1] iris$Petal.Width

Root node error: 100/150 = 0.66667

n= 150 

    CP nsplit rel error xerror     xstd
1 0.50      0      1.00   1.12 0.053267
2 0.44      1      0.50   0.58 0.059643
3 0.01      2      0.06   0.07 0.025833

(2) Model with all predictors

> iris.rpartall = rpart(iris$Species ~ ., data=iris, method="class",)
n= 150 

node), split, n, loss, yval, (yprob)
      * denotes terminal node

1) root 150 100 setosa (0.33333333 0.33333333 0.33333333)  
  2) Petal.Length< 2.45 50   0 setosa (1.00000000 0.00000000 0.00000000) *
  3) Petal.Length>=2.45 100  50 versicolor (0.00000000 0.50000000 0.50000000)  
    6) Petal.Width< 1.75 54   5 versicolor (0.00000000 0.90740741 0.09259259) *
    7) Petal.Width>=1.75 46   1 virginica (0.00000000 0.02173913 0.97826087) *
> printcp(iris.rpartall)

Classification tree:
rpart(formula = iris$Species ~ ., data = iris, method = "class")

Variables actually used in tree construction:
[1] Petal.Length Petal.Width 

Root node error: 100/150 = 0.66667

n= 150 

    CP nsplit rel error xerror     xstd
1 0.50      0      1.00   1.23 0.047053
2 0.44      1      0.50   0.73 0.061215
3 0.01      2      0.06   0.11 0.031927

see this [thread](http://stats.stackexchange.com/questions/166560/cart-selection-of-best-predictor-for-splitting-when-gains-in-impurity-decrease/166914?noredirect=1#comment316687_166914) — Antoine, Aug 13 '15 at 10:20

How does RPART pick a splitter when there are at least 2 splits having maximal information gain?

0 Answers0

Linked