1

I've read many related articles and posts. The more I read, the more I got confused about 'Gini index' and 'Gini Impurity'. I understood the concept but it seems to me that these things are used differently by different people. ISLR book* (page 326) defines Gini Index as $\sum p_i(1– p_i)$ or $1 - \sum p_i^2$.

enter image description here

However, this (and many other articles) [the Same question has been asked in comments too by Shanu_not answered though] compute Gini by $ p^2+q^2$ formula for Binary classifier.

So, their Gini Impurity [ 1 $-$ Gini Index] is exactly the same as the Gini Index computed as per ISLR book.

Please let me know what am I missing. I realize that reading concepts after a long break is painful.

*Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani. (2013). An introduction to statistical learning : with applications in R. New York :Springer,

Dr Nisha Arora
  • 884
  • 1
  • 8
  • 21
  • 1
    Apart from looking at the formulas, the words _purity_ and _impurity_ are indicative (so long as they are used carefully). $\sum p_i^2$ is maximal (purity is highest) when there is just one category present and so the sum is the sum of $1^2$ and any number of $0^2$ and so just $1$. $1 - \sum p_i^2$ is minimal in the same case (impurity is lowest). – Nick Cox Jun 05 '19 at 06:09
  • 1
    $p^2 + q^2$ is clearly not a general recipe, but applies only when there are two categories in play. – Nick Cox Jun 05 '19 at 06:12
  • The Gini index formula is the $G$ you defined above. That $p^2 + q^2$ computes somehow purity, it is specific to two classes, and the $1$ from $G$ got removed because it is constant when you compare two nodes in a decision tree. Usually splitting criteria in decision trees use impurity measures: eg Gini index or entropy. An example here: https://stats.stackexchange.com/questions/44382/mathematics-behind-classification-and-regression-trees/44404#44404 – Simone Jun 05 '19 at 06:29
  • ISLR won't be recognisable by all readers, so please give references in good academic style (authors, date, book title, publishers, place). – Nick Cox Jun 05 '19 at 07:44
  • Thanks Nick, done that. – Dr Nisha Arora Jun 05 '19 at 16:09
  • "the words PURITY and IMPURITY are indicative". I understood this concept. And Obviously, the decision rule is the same both ways [Even for other data sets]. However, as per the book referred 'Gini-Index' is a measure of impurity and as per the article, I mentioned above, ' 1 - Gini-Index' is a measure of impurity [Called Gini Impurity] – Dr Nisha Arora Jun 05 '19 at 16:53
  • Yeah, I should mention it that p^2 + q ^2 applies only for binary classification. – Dr Nisha Arora Jun 05 '19 at 16:55

1 Answers1

1

Usually, the terms Gini Index and Gini Impurity are used as synonyms. Indeed, when defined as $1-\sum p_i^2 $ it measures impurity - in the sense that it increases with impurity.

To me it looks like the link you gave uses an alternative, rather confusing definition, where they use Gini Index as a measure of purity, and Gini Impurity as 1-Index. This is something I had never seen in the literature, and it does not seem to recur anywhere else (I took a quick tour of links and definitions, and I could not find it anywhere else).

Therefore, I would rather use the definition you can find in Hastie/Tibshirani's book, as it is the most common. Indeed, we can trace that definition it back to Classification And Regression Trees (Breiman, 1984):

4.3.3 The Gini Criterion
[...]
In later work the Gini diversity index was adopted. This has the form: $$i(t) = \sum_{i\neq j}p(i|t)p(j|t)$$ and can also be written as $$i(t) = [...] =1-\sum_jp(j|t)^2$$

The original name is therefore Gini (diversity) index, but since it is a measure of impurity you may also call it Gini impurity.

Davide ND
  • 2,305
  • 8
  • 24