5

There's a step in the Wikipedia article regarding the formulation of the Gini Impurity that I can't understand.

They state that:

enter image description here

I follow everything up until this point

$1-\sum_{i=1}^Jf_i^2 = \sum_{i\ne k}f_if_k$

There is a related thread that gives an intuitive explanation, but I'm wondering if anyone knows the actual mathematics behind this step.

ZachTurn
  • 185
  • 4

2 Answers2

3

Here is a snippet from my answer here. The easiest way (for me at least) to understand

$1-\sum f_i^2$ = $\sum_{i \neq k} f_if_k$

is by visually representing each of the elements in this equation. We'll assume that there are 4 labels below; however, this will scale to n values.

enter image description here

The value 1 is simply the sum of all possible probabilities. By definition this value must be 1.

The value $\sum f_i^2$ is the sum of probabilities of selecting a value and its label from the distribution of values.

Subtracting the probability that you match labels with values from 1 gives you the probability that you don't match labels and values. This is what the gini impurity provides -- the probability that you don't match labels to values.

Scott
  • 900
  • 1
  • 8
  • 12
1

I remember reading this exact thing on Wikipedia thinking it was a typo. It's not though. And the math is really simple. Note that $f_if_k$ corresponds to the probability of observing an $i$ followed by a $k$ from two independent draws from the distribution $f$. Therefore, if you sum over the probabilities of all $(i,k)$ pairs you get $1$. In other words, we have the equality, $$\sum_{i=1}^J \sum_{k=1}^J f_i f_k = 1$$ But we can rewrite the double summation as $$\sum_{k=1}^J f_i f_k = \sum_{i=1}^J f_i^2 + \sum_{i=1}^J \sum_{k=1, k \ne i}^J f_i f_k$$ Then, if you subtract $\sum_{i=1}^J f_i^2$ from the top and bottom, you end up with the equality of interest.

jjet
  • 1,187
  • 7
  • 12
  • If you show the products geometrically as tiles in a unit square; the $f_j^2$ terms can be represented by tiles on the diagonal and the others are then ... the others off the diagonal. Simple example: 0.7 and 0.3, each squared, can be plotted as two tiles of area 0.49 and 0.09 and two more tiles each of area 0.21 $=$ 0.7 $\times$ 0.3. More generally, you always have $J$ tiles that can be placed on the diagonal and then the others. – Nick Cox May 02 '17 at 17:55
  • Yeah this is the right idea. I think a visual illustration would be more helpful than my mathematical derivation. – jjet May 02 '17 at 17:59
  • 1
    @jjet, you can see my answer here for a more visual than mathematical explanation https://stats.stackexchange.com/a/339514/27974 – Scott May 30 '18 at 16:51