Questions tagged [gini]

The Gini coefficient is used to measure income inequality and discriminatory power of a classifier. If everybody has the same income, Gini coefficient = 0. If one person has all the income, Gini coefficient = 1. All other values are somewhere in between.

The Gini coefficient is used to measure income inequality and the discriminatory power of a classifier.

In income distribution, the Gini index is best explained using Lorenz curve, which shows proportion of population ordered by income on x axis, and the proportion of income these people have on y axis. E.g., a point (0.3, 0.05) on Lorenz curve means that the poorest 30% of people receive 5% of the total income in the economy. See below for a Lorenz curve from Wikipedia.

cumulative share of people vs. cumulative share of income

Gini coefficient, or Gini index, is double the grey area between the line of equality and the actual Lorenz curve. If everybody has the same income, Gini coefficient = 0. If one person has all the income, Gini coefficient = 1. All other values are somewhere in between.

There are various equivalent expressions for the Gini coefficient based on i.i.d. data:

\begin{align} G &= \frac1{\bar y n(n-1)} \sum_{i\neq j} |y_i - y_j| \\[10pt] &= \frac{1}{n}\left ( n+1 - 2 \left ( \frac{\sum\limits_{i=1}^n \; (n+1-i)y_i}{\sum\limits_{i=1}^n y_i} \right ) \right ) \\[10pt] &= \frac 1{\bar y} {\rm Cov}(F_Y(y), Y) \end{align}

where $\bar y$ is the mean income. The first expression clearly shows an interpretation of Gini coefficient as the average difference in incomes across the population (if you were taken out and thrown back into this population into a random position, by how much would your income change?), as well as provides the kernel of the second order $U$-statistic, while the last one shows relation of Gini coefficient with moment of the distribution allowing for generalization to non-i.i.d. data (such as complex survey data).

In classification applications, Gini coefficient is analogous to the area under ROC curve, with $$ AUC = (G+1)/2 $$

See also:

Wikipedia article

A. Sen. On economic inequality.

110 questions
24
votes
2 answers

What is the relationship between the GINI score and the log-likelihood ratio

I am studying classification and regression trees, and one of the measures for the split location is the GINI score. Now I am used to determining best split location when the log of the likelihood ratio of the same data between two distributions…
18
votes
1 answer

What is the difference between GINI and AUC curve interpretation?

we used to create GINI curve using lift created with help of percentage of good and bad for scorecard modelling. But what I have studied that ROC curve is created using Confusion matrix with Specificity (1- True Negative) as x axis and sensitivity(…
user78837
  • 181
  • 1
  • 1
  • 4
17
votes
1 answer

logloss vs gini/auc

I've trained two models (binary classifiers using h2o AutoML) and I want to select one to use. I have the following results: model_id auc logloss logloss_train logloss_valid gini_train gini_valid DL_grid_1 0.542694 0.287469…
Dan
  • 1,288
  • 2
  • 12
  • 30
17
votes
2 answers

Why use Normalized Gini Score instead of AUC as evaluation?

Kaggle's competition Porto Seguro's Safe Driver Prediction uses Normalized Gini Score as evaluation metric and this got me curious about the reasons for this choice. What are the advantages of using normalized gini score instead of the most usual…
xboard
  • 1,008
  • 11
  • 17
16
votes
2 answers

A simple & clear explanation of the Gini impurity?

In a context of decision tree splitting, it is not obvious to see why the Gini impurity $$ i(t)=1-\sum\limits_{j=1}^k p^2(j|t) $$ is a measure of node t impurity. Is there an easy explanation of this?
Picaud Vincent
  • 481
  • 1
  • 3
  • 12
15
votes
1 answer

Does Breiman's random forest use information gain or Gini index?

I would like to know if Breiman's random forest (random forest in R randomForest package) uses as a splitting criterion (criterion for attribute selection) information gain or Gini index? I tried to find it out on…
somebody
  • 151
  • 1
  • 4
12
votes
3 answers

Difference is summary statistics: Gini coefficient and standard deviation

There are several summary statistics. When you want to describe the spread of a distribution you can use for example the standard deviation or Gini coefficient. I know that the standard deviation is based on central tendency, i.e. deviation from the…
Olivier_s_j
  • 1,055
  • 2
  • 11
  • 25
11
votes
4 answers

Trying to compute Gini index on StackOverflow reputation distribution?

I'm trying to compute the Gini index on the SO reputation distribution using SO Data Explorer. The equation I'm trying to implement is this:…
yossale
  • 213
  • 4
  • 9
11
votes
1 answer

Gini coefficient and error bounds

I have a time series of data with N=14 counts at each time point, and I want to calculate the Gini coefficient and a standard error for this estimate at each time point. Since I have only N=14 counts at each time point I proceeded by calculating the…
Sean
  • 1,569
  • 1
  • 18
  • 25
10
votes
5 answers

How to measure dispersion in word frequency data?

How can I quantify the amount of dispersion in a vector of word counts? I'm looking for a statistic that will be high for document A, because it contains many different words that occur infrequently, and low for document B, because it contains one…
dB'
  • 225
  • 3
  • 15
8
votes
1 answer

Rank correlation statistics comparison

I am trying to understand the relative behavior of the following rank correlation statistics: Spearman coefficient Kendall Tau / Concordance percentage Normalized Gini coefficient (area under curve of percentage captured versus percentage…
cohoz
  • 618
  • 5
  • 16
8
votes
4 answers

Basic Gini impurity derivation

From wikipedia: https://en.wikipedia.org/wiki/Decision_tree_learning I am unable to get my head around two of the steps: The first equation: $f_i(1 - f_i)$. This does not immediately become apparent as the "probability of being chosen times…
7
votes
2 answers

Computing the Gini index

How do I compute the Gini index using Instance attribute as attribute test condition? I calculated the Gini, but I have no clue how to do it for this Instance attribute. $$\text{Gini for } a_1 = 0.345 $$ $$\text{Gini for } a_2 = 0.493…
Mike John
  • 624
  • 3
  • 6
  • 19
7
votes
2 answers

How is the Weighted Gini Criterion defined?

I am interested in trying out and/or implementing the Weighted Random Forest (WRF) algorithm described in Chen, Liaw, Breiman. How is the Weighted Gini impurity actually defined? What implementations of the algorithm exist? My best guess would be…
Kevin Teh
  • 173
  • 1
  • 1
  • 5
7
votes
1 answer

Deviance vs Gini coefficient in GLM

What are the pros and cons of using Deviance as opposed to Gini coefficient when measuring the quality of regression / classification models? From experience, I see that people like Gini more than Deviance. I don't know the reason, but perhaps the…
1
2 3 4 5 6 7 8