11

I'm trying to compute the Gini index on the SO reputation distribution using SO Data Explorer. The equation I'm trying to implement is this: $$ G(S)=\frac{1}{n-1}\left(n+1-2\left(\frac{\sum^n_{i=1}(n+1-i)y_i}{\sum^n_{i=1}y_i}\right)\right) $$ Where: $n$ = number of users on the site; $i$ = user serial id (1 - 1,225,000); $y_i$ = reputation of user $i$.

This is how I implemented it (copied from here):

DECLARE @numUsers int
SELECT @numUsers = COUNT(*) FROM Users
DECLARE @totalRep float
SELECT @totalRep = SUM(Users.Reputation) FROM Users
DECLARE @giniNominator float
SELECT @giniNominator = SUM( (@numUsers + 1 - CAST(Users.Id as Float)) * 
                              CAST(Users.Reputation as Float)) FROM Users
DECLARE @giniCalc float
SELECT @giniCalc = (@numUsers + 1 - 2*(@giniNominator / @totalRep)) / @numUsers
SELECT @giniCalc

My result is (currently) -0.53, but it makes no sense: I'm not sure even how it could have become negative, and even in abs value, I would have expected the inequality to be much closer to 1, given how reputation grows the more you have it.

Am I unknowingly ignoring some assumption about the distribution of the reputation/users?

What do I do wrong?

yossale
  • 213
  • 4
  • 9
  • You're right, but I'm not sure I see why this should effect the calculation? – yossale Aug 25 '12 at 12:28
  • 3
    I'm guessing that your question is about the nature & calculation of the Gini index, & not about how to implement that in SQL (correct me if I'm wrong). If the latter, we should migrate this to SO. Continuing w/ my assumption, I have copied your code from the SE data site, but it might help if you can also rewrite it in pseudo-code for those who may not read SQL well. – gung - Reinstate Monica Aug 25 '12 at 13:36
  • @gung thanks - I do ask about the calculation, not the SQL implementation. I'll re write it in pseudo code – yossale Aug 25 '12 at 14:56

4 Answers4

12

I can't read the SQL code very easily, but if it helps, if I were going to calculate the Gini coefficient, this is what I would do (in plain English).

  1. Figure out the $n$ of $x$ (ie. the number of people with rep on SO)
  2. Sort $x$ from lowest to highest
  3. Sum each $x$ multiplied by its order in the rank (ie. if there are 10 people, the rep for the person with the lowest rep gets multiplied by 1 and the rep of the person with the highest rep gets multiplied by 10)
  4. Take that value and divide it by the product of $n$ and the sum of $x$ (ie. $n \times \sum $ rep) and then multiply that result by 2
  5. Take that result and subtract the value of $1-(1/n)$ from it.
  6. Voila!

I took those steps from the remarkably straight-forward code in the R function (in the ineq package) for calculating the Gini coefficient. For the record, here's that code:

> ineq::Gini
function (x) 
{
    n <- length(x)
    x <- sort(x)
    G <- sum(x * 1:n)
    G <- 2 * G/(n * sum(x))
    G - 1 - (1/n)
}
<environment: namespace:ineq>

It looks somewhat similar to your SQL code, but like I said, I can't really read that very easily!

chl
  • 50,972
  • 18
  • 205
  • 364
smillig
  • 2,336
  • 28
  • 31
  • Thanks you *very* much! I missed the sorting part! that explains a lot... – yossale Aug 25 '12 at 15:21
  • Super. I'm interested in knowing what the value is so maybe leave a comment when you've made the calculation! – smillig Aug 25 '12 at 16:04
  • Well, When I aggregated the values (i.e if there are 10 people, with either 1,3, or 5 points, then i have just 3 ranks : 1:3,2:5,3:10) and multiplied the (how many with that score)*score*(rank of score) I got -0.98 , which would have made sense if not for the wrong sign. But I'm not sure how my little shortcut effects the gini scale – yossale Aug 25 '12 at 16:16
  • Would you not have to assign the average score? I.e. for 1:3 apply $3 \times 2$, for 2:5 apply $4 \times 3.5$ etc.? Or did you do that? – Björn May 07 '16 at 05:42
4

There are, I believe, four equivalent formulations of the Gini index. To me, the most natural one is a U-statistic: $$ G = \frac 2{\mu n(n-1)}\sum_{i\neq j} |x_i - x_j| $$ where $\mu$ is the mean of $x$'s. You can double-check your computations with this formula. Obviously, the result must be non-negative. For what I know about Gini indices, the reputation distribution on CV should have the Gini index above 0.9; whether 0.98 makes a lot of sense or not, I can't say though.

StasK
  • 29,235
  • 2
  • 80
  • 165
1

Adding to @smillig answer, based on the provided equation:

SELECT something AS x into #t FROM sometable
SELECT *,ROW_NUMBER() OVER(ORDER BY x) AS i INTO #tt FROM #t
SELECT 2.0*SUM(x*i)/(COUNT(x)*SUM(x))-1.0-(1.0/COUNT(x)) AS gini FROM #tt

Gave me on my test set:

0.45503253636587840

Which is the same as R's ineq libraries Gini(x)

Chris
  • 1,151
  • 9
  • 26
  • ;WITH t AS (SELECT CAST(income AS FLOAT) AS x FROM #data), tt AS (SELECT *,ROW_NUMBER() OVER(ORDER BY x) AS i FROM t) SELECT 2.0*SUM(x * i)/(COUNT(x) * SUM(x))-1.0-(1.0/COUNT(x)) AS gini FROM tt – Chris Sep 22 '16 at 21:41
1

Here is how you can calculate it with SQL:

with balances as (
    select '2018-01-01' as date, balance
    from unnest([1,2,3,4,5]) as balance -- Gini coef: 0.2666666666666667
    union all
    select '2018-01-02' as date, balance
    from unnest([3,3,3,3]) as balance -- Gini coef: 0.0
    union all
    select '2018-01-03' as date, balance
    from unnest([4,5,1,8,6,45,67,1,4,11]) as balance -- Gini coef: 0.625
),
ranked_balances as (
    select date, balance, row_number() over (partition by date order by balance desc) as rank
    from balances
)
SELECT date, 
    -- (1 − 2B) https://en.wikipedia.org/wiki/Gini_coefficient
    1 - 2 * sum((balance * (rank - 1) + balance / 2)) / count(*) / sum(balance) AS gini
FROM ranked_balances
GROUP BY date
ORDER BY date ASC
-- verify here http://shlegeris.com/gini

Explanation is here https://medium.com/@medvedev1088/calculating-gini-coefficient-in-bigquery-3bc162c82168

medvedev1088
  • 126
  • 2