Derivation of the Simpson index

Question

I'm interested in the derivation of the Simpson diversity index. I found the normal interpretation of this index, but I'm not able to see what's the "mathematical background" of the formula. I read the original paper of Simpson, but I'm more interested in an easier explanation (my skills in probability theory are more on an undergraduate level). I would appreciate a little help here.

Thanks in advance

Meiner

score 1 · Answer 1 · answered May 21 '14 at 10:20

1

The "normal" interpretation of the index is that it is an index to measure biodiversity.

Mathematically, the formula is $ D=\sum(n/N)^2$ or $D=\sum\frac {n(n-1)} {N(N-1)}$

$n$ is the number of a given species and $N$ is the total population. What is being calculated is probability that two randomly selected individuals from a sample belong to the same species.

The bigger the value of D, the more likely they are to be from the same species, hence the lower the diversity. To fix this problem of counter-intuitiveness $1-D$ is calculated instead.

answered May 21 '14 at 10:20

rocinante

643
5
11

Or $1/D$, which is interpretable as the equivalent number of equally common species. Note that Simpson borrowed from Turing and that similar if not identical measures were suggested earlier and later by Gini, Hirschman, Herfindahl, Greenberg, etc. in statistics, economics, linguistics, etc. Even in ecology the discreteness of individual organisms that drives using $n (n -1)$ rather than $n^2$, etc., is not a universal, as abundance may be measured as a fraction or percent rather than a count. – Nick Cox May 21 '14 at 16:00
@Nick Cox I didn't know that. For econ specifically, I think the Gini inequality index is what most(?) people think of when you mention the Gini coefficient. The Gini coefficient though is not really all that similar to the Simpson index, though. – rocinante May 21 '14 at 16:14
We've both right. The name Gini has been attached to several quite different measures. It was his fault; he did invent several. Gini coefficients as particularly associated with (e.g.) income inequality are indeed different. But it's true that he suggested $1 - D$ in your notation long before Simpson. – Nick Cox May 21 '14 at 16:18

Adam Bailey · Accepted Answer · 2014-05-21T14:42:39.043

There are several versions of the Simpson diversity index, as explained in this website. I will focus here on this version, which I have stated slightly more precisely to clarify what the sum is over:

$$D = \sum_{i=1}^k\frac{n_i(n_i-1)}{N(N-1)}$$ Here $N$ is the total number of individuals within a habitat, and $n_i$ is the number of individuals of the $i$th of $k$ species. Note that, since $N$ is not indexed by $i$, it makes no difference whether the denominator is within the scope of the sum. We can equally write:

$$D =\frac{\sum_{i=1}^kn_i(n_i-1)}{N(N-1)}$$

The derivation of these formulae is a straightforward application of probability. If two individuals are chosen at random from the habitat, the total number of possible outcomes $O_{tot}$ is:

$$O_{tot} = N(N-1)/2$$

The division by $2$ is to avoid duplication where the same two individuals are chosen in reverse order. The outcomes of interest $O_{int}$ are those in which the chosen two individuals belong to the same species. For any one species $i$ the number of pairs $P_i$ of individuals belonging to that species is:

$$P_i = n_i(n_i-1)/2$$

To find $O_{int}$ we must sum over all species:

$$O_{int} = \sum_{i=1}^kP_i = \sum_{i=1}^kn_i(n_i-1)/2$$

Dividing $O_{int}$ by $O_{tot}$ and cancelling the divisions by $2$ yields the second of the above formulae for $D$.

Derivation of the Simpson index

2 Answers2