Computing the Gini index

Question

enter image description here

How do I compute the Gini index using Instance attribute as attribute test condition?

I calculated the Gini, but I have no clue how to do it for this Instance attribute.

$$\text{Gini for } a_1 = 0.345 $$ $$\text{Gini for } a_2 = 0.493 $$ $$\text{Gini for } a_3 = ?$$

I am guessing the answer to this is that Instance attribute has no information gain. However, I can't prove this.

$$\text{Gini} = 1 - \sum_i p(i|t)^2$$

Dear Mike, please decide on one SE site when posting your question. It is probably fine either here or on the math site, but whichever you choose, please flag the other one and ask that it be closed. Cheers. — cardinal, Nov 22 '13 at 13:01
@cardinal Since it is unclear which of the two sites might be able to help with my question (since no one has answered) I will keep both until I get an answer in one. I think that is a fair assessment. — Mike John, Nov 22 '13 at 13:15
Not so, as cross-posting is specifically discouraged: see http://stats.stackexchange.com/help/on-topic More to your point, I once counted four different meanings for Gini index, so just mentioning the term without a definition helps little. — Nick Cox, Nov 22 '13 at 13:44
Fine, Math post has been closed. I added the Gini formula above if anyone still wants to actually help me. — Mike John, Nov 22 '13 at 13:58

Nick Cox · Accepted Answer · 2013-11-23T10:34:55.960

Gini index here ($G$, say) just calculates diversity or heterogeneity (or uncertainty if you will) from the sum of squared category probabilities. If every value is in the same category, then the measure is $1 - 1^2 = 0$. If every value of $n$ values is in a distinct category, then the measure is $1 - n(1/n)^2 = 1 - 1/n$. The complement is in some ways easier to think about, e.g. the reciprocal of the complement $1 / (1 - G)$ returns the "numbers equivalent", i.e. the equivalent number of equally common classes. Thus, the extremes for that are clearly $1/1$ and $1/(1/n)$, i.e. $1$ and $n$.

Your columns $a_1$ and $a_2$ have 4 T and 5 F and 5T and 4F, respectively, which I get to be the same index, namely $1 - (4/9)^2 - (5/9)^2 = .4938271605$; that's a ridiculous number of decimal places, but it suggests that you have a gross error for one column and a rounding error for the other. With your $a_3$ the principle does not change, as the index ignores labels on the categories: whatever metric meaning they might have is not considered. By my calculation you have $1 - 5((1/9)^2) - 2 ((2/9)^2) = .8395061728$.

Other names for this measure $G$ (or its complement, or the reciprocal of that) are Simpson, Herfindahl and repeat rate. Gini appears to have got there first, but its applications across ecology, economics, linguistics and many other fields are legion.

score 0 · Answer 2 · answered Jan 20 '20 at 15:46

I'm sorry to bring back a question from ages ago, but it came as reference in a newer one and it looks to me like it might cause some misunderstandings.

The calculations that Nick Cox gave are absolutely correct when computing the Gini index of the features, and help give us information about the features and their homogeneity.

However, based on the fact that your dataset has a Target variable, that you speak of using Instance as attribute test condition and that you name at the end the information gain, it would be easy to think that you have a Classification Tree problem, and that the goal is finding the Gini decrease (equivalent to the Information Gain) when splitting (testing) on the features . I will therefore give an alternative answer to the question, that will serve as reference for Gini computation in case of Classification Trees.

In this sense, the computations would be different.

As a first step, you would need to compute the Gini index of the starting dataset. It has 4 positives and 5 negatives and therefore it is: $$GiniStart = 1-(\frac{4}{9})^2 - (\frac{5}{9})^2 \sim 0.4938$$
If we split on $a_1$ we obtain the node $T$ that has three positive instances and a negative one, and node $F$ that has one positive instance and four negative ones. $$Gini_T = 1-(\frac{3}{4})^2 - (\frac{1}{4})^2 = 0.375$$ $$Gini_F = 1-(\frac{1}{5})^2 - (\frac{4}{5})^2 = 0.32$$ $$\Delta Gini_{a_1} = GiniStart - \frac{4}{9}Gini_T -\frac{5}{9}Gini_N \sim 0.149 $$
If we split on $a_2$ we obtain the node $T$ that has 2 positive instances and 3 negative ones, and node $F$ that has 2 positive instances and 2 negative ones. $$Gini_T = 1-(\frac{2}{5})^2 - (\frac{3}{5})^2 = 0.48$$ $$Gini_F = 1-(\frac{2}{4})^2 - (\frac{2}{4})^2 = 0.5$$ $$\Delta Gini_{a_2} = GiniStart - \frac{5}{9}Gini_T -\frac{4}{9}Gini_N \sim 0.005$$
$a_3$ is instead a numeric variables (even though you could treat it as categorical), and as such we would need to evaluate every possible split in their range (which I am of course not going to do) and choose the best one.
As an example, imagine splitting $a_3$ in $4.5$. Then we would have the "Low Values" node, with 2 positives and a negative, and the "High Values" node, with 2 poisitives and 4 negatives. $$Gini_{LV} = 1-(\frac{2}{3})^2 - (\frac{1}{3})^2 = 0.44$$ $$Gini_{HV} = 1-(\frac{2}{6})^2 - (\frac{4}{6})^2 = 0.44$$ $$\Delta Gini_{a_3} = GiniStart - \frac{3}{9}Gini_{LV} -\frac{6}{9}Gini_{HV} \sim 0.049$$
Finally $Instance$. If we consider it as categorical, the variable is totally sparse, and it allows us to split it into groups ${1,2,4,8}$ and ${3,5,6,7,9}$, that would both have a Gini of $0$ since they are pure. The Gini increase would therefore be the Gini of the parent node.

However, attributes like $Instance$ are completely sparse and have no predicting power on new data (every new entry will have a different instance number), and for this reason are usually excluded. Alternatively, one can use the Gini ratio for the splits. That is, weight the Gini decreases by the inverse of the Gini coefficient of the attributes (this time, it's the ones computed by Nick!). This way, importance of sparse variables such as $Instance$ is reduced by the fact that their Gini coefficient is very high.

Please mind that my answer is not in contradiction with Nick's, but it is simply answering a different question, since it might have been also interpreted differently.

PS: just for clarification, as the information gain term was used: Information Gain is almost equivalent to difference in Gini index, and it is computed as difference in Entropy.

Computing the Gini index

2 Answers2

Linked