Choosing between best two attributes with the same information gain when building decision tree

Question

When building a decision tree, suppose that there are two attributes that have the same maximum information gain.

Will there be any difference between choosing any of the two attributes to be a tree node? Or are there any other factors that I have to consider in order to decide which attribute should I choose?

score 7 · Answer 1 · edited Apr 13 '17 at 12:44

The answer is simply that the first predictor (as found from left to right in the original data frame) is selected. See this thread for the proof. So, to sort everything out:

Will there be any difference between choosing any of the two attributes to be a tree node?

No, selecting one or the other makes absolutely no difference.

So for text classification, it will check in the alphabetical order.

This is only true to the extent that (the columns of) the data frame of predictors are ordered alphabetically.

You could look ahead at the information gain of the remaining attributes after a split and select based on that.

This is not a bad idea and is doable by hand for a small tree, but that's definitely not what's implemented in CART. First, it would be very expensive: imagine for a big tree, all the combinations that one would have to try. More importantly, this goes against the idea of recursive partitioning where at each step, the best predictor can be determined simply as the one that yields the best partition of the current node. This process is conditional on the previous splits (the current node was created by the previous splits); it's not the other way around. CART is inherently greedy and it was shown that looking ahead did not give significantly better results, see this PhD thesis section 2.5.4.

Can you try both trees, and see which creates a better separation at your terminal nodes and/or builds a tree that fits more closely with the theoretical understanding you have of your data?

Again, with a large number of predictors, the numbers of trees to try would be immense. Further, machine learning is often used when the understanding of the relationship between input and output variables is very limited (otherwise, an explicit model could be specified, see this paper), so making a decision based on theoretical understanding is most of the time not possible. Finally, CART trees are usually the base models of ensembles such as RForest and Boosting, where large numbers of trees are automatically grown, so it is completely impossible to inject any kind of human-understanding into the tree-building process in these cases.

I agree with you,when you have same information gain attributes,its better to used other models and croos check the results — alily, Aug 17 '16 at 08:34

score 2 · Accepted Answer · edited Aug 29 '15 at 20:01

2

You could look ahead at the information gain of the remaining attributes after a split and select based on that. In general though, if you're using information gain as your splitting criterion, it will be the only thing to look at.

edited Aug 29 '15 at 20:01

gung - Reinstate Monica

132,789
81
357
650

answered Feb 20 '12 at 18:40

Lars Kotthoff

408
3
4

Beside "information gain" (in the case that we have two best info gain), what should be another criteria? – wannik Feb 22 '12 at 03:05
1

You _could_ use other criteria such as the [Akaike information criterion](http://en.wikipedia.org/wiki/Akaike_information_criterion). I don't think that anything _should_ be another criterion though. – Lars Kotthoff Feb 22 '12 at 17:55
1

If you were building a random forest, then you would be finding bootstrapped estimates of the information gain. That would give you a robust measure of information. – EngrStudent Sep 29 '15 at 19:47

score 1 · Answer 3 · edited Sep 29 '15 at 19:37

I work with the text classification problem, and for the classification I am using decision tree classifiers(ID3, Random forests etc). So I can give you an example that is related to text classification. In classification we are going to deal with the different words as the attributes, and you can reduce the features using information gain threshold and once you have all the reduced features with you, it will follow the procedure that I have mentioned below.

While building the decision tree, it will start with the attribute having the highest information gain, and now there are more than one words/attributes with the same information gain value. So for text classification, it will check in the alphabetical order.

For example: for root node there are two words with highest information gain ("Good" with IG=0.5, and "Awesome" with IG=0.5), "Awesome" will be selected as the root node.

Hope this will help to solve the doubt.

It will be great if you can download the machine learning package called "Weka" and try out the decision tree classifier with your own dataset. As the beautiful thing is, after the classification process it will allow you to see the decision tree created. ID3, Random Tree and Random forest of Weka uses Information gain for splitting of nodes.

score 1 · Answer 4 · answered Feb 20 '12 at 19:27

The decision tree method is a step-wise approach to model building, so variables entered earlier have a direct impact on which variables are entered later. Can you try both trees, and see which creates a better separation at your terminal nodes and/or builds a tree that fits more closely with the theoretical understanding you have of your data?

You could also see what the validation testing shows after building both trees, and go with the tree that fits all the data best.

Choosing between best two attributes with the same information gain when building decision tree

4 Answers4

Linked