Gini index in classification tree

Question

In Gareth etc.'s book "An introduction to statistical learning", when it's talking about Gini index, I clipped the paragraph in the following image:

My question is the statement that "For this reason the Gini index is referred to as a measure of node purity -- a small value indicates that a node contains predominantly observations from a single class." I don't understand the logic behind this. So when $\hat p_{mk}$ is close to 1 (and thus Gini index is small), which means by its definition that most of the training observation in the mth region are from the kth class. If it's in income statistics, does it mean the training observation is from high-income or low-income class? If yes, however, a contradiction to this is that if Gini index is high in income statistics, it means most people are either in the high-income class or low-income class (wealth gap is large). What am I misunderstanding here?

It's even more obscure to me when $\hat p_{mk}$ is close to 0.

Of possible interest: https://stats.stackexchange.com/a/6610/930 (regarding node impurity and deviance in CART). — chl, Nov 01 '20 at 19:12

score 1 · Answer 1 · answered Mar 11 '19 at 08:23

I would shy away from any specific label interpretations and just look at it from a general classification stand point. You have $K$ classes (I omit the region index, because its not important) and empirical multinomial distribution $\hat{p}_1,\ldots,\hat{p}_K$. Gini index, apart from some other impurity measures like mutual information, is just one of the more sensible measures (its concave, which is very important) you can use.

If you want to interpret it in some way then it is an average number of classification errors you make when you sample a random example from the region and assign it a label at random using the empirical distribution $p_k$.

I always found Gilles Louppe's PhD Thesis - Understanding Random Forests: From theory to practice (https://arxiv.org/pdf/1407.7502.pdf) one of the best reference to understand how the tree-based methods actually work (not only theoretically but how to implement them efficiently).

score 0 · Answer 2 · answered Mar 11 '19 at 07:30

If, in a given node of the tree, a vast majority of individuals have "high income", Gini index is small. So is it if a vast majority of individuals have "low income". The highest value for the Gini index will be achieved when half of the observations are "high income" and the other half are "low income".

So I think you got the idea right. You may feel uncomfortable about this result because income is actually a numerical variable and we are only splitting the population in "high" and "low" on an arbitrary threshold. However, think of light vs brown eyes and you will see that it makes the most sense to see a high Gini index when the node is "least pure" or "most diverse" (this meaning half of the individuals in the node have brown eyes and the other half have blue ones)

In summary, you got it right, except for the fact that there is no contradiction. Let's think it otherwise to make it more intuitive.

Imagine there are only two possible yearly incomes, "low" corresponding to €10,000 and "High" corresponding to "€90,000" (remember that, when using the labels "high" and "Low", your model cannot distinguish any further than this) A population with half of €90K earners and half of €10K earners is more "unequal" than another one with millions of people making €10K and one lucky guy who makes €90K, isn't it?

Gini index in classification tree

2 Answers2