4

The question below was asked on a sister site (Stack Overflow) back in 2010 by a user still active there (to me it seems more suitable here, for example quite similar to 21422):

I have a bunch of data in Excel that I need to get certain percentile  
information from. The problem is that instead of having the data set made up of  
each value, I instead have info on the number of or "bucket" data.
For example, imagine that my actual data set looks like this:  
1,1,2,2,2,2,3,3,4,4,4  

The data set that I have is this:
Value    No. of occurrences
  1              2
  2              4
  3              2
  4              3
Is there an easy way for me to calculate percentile information (as well as the  
median) without having to explode the summary data out to full data set? (Once I  
did that, I know that I could just use the Percentile(A1:A5, p) function)  

This is important because my data set is very large. If I exploded the data out,  
I would have hundreds of thousands of rows and I would have to do it for a  
couple of hundred data sets.

I doubt after all this time the OP is still concerned but if not off topic here I would appreciate your views (but won’t be upset if this is closed!)

I am interested because (from a very long time ago!) I thought I had learned that, once binned, key information for accurate calculation of percentiles was irretrievably lost. But this is much more more my curiosity than a pressing need.

pnuts
  • 157
  • 1
  • 7
  • 1
    If the bins are labelled '1', '2' and so on, this is now discrete data; and the counts are in effect giving a histogram. You can work out which quantiles correspond to which bins easily enough - just cumulate the counts and divide by the last count, and you have what is in effect the entire ECDF. – Glen_b Jul 27 '13 at 00:59
  • @pnuts, I am wondering if you were able to implement this in Excel somehow, every solution I come up is more convoluted than the other :) – Jeanno Mar 23 '15 at 12:56

2 Answers2

8

It is quite easy actually.

Let's say the sum of the counts is N, and you that you want the 0.3 (30%) bottom percentile. This means the threshold value will occur after 0.3*N counts.

Now you look at the cumulative distribution, and when it reaches 0.3*N, you have the value. It is very easy to implement.

For example, you had this:

Value    No. of occurrences
  1              2
  2              4
  3              2
  4              3

So you convert it into the cumulative:

Value    No. of occurrences
 <=1              2
 <=2              6
 <=3              8
 <=4              11

Here N=11 so 0.3*11=3.3. This happens in the second bin, so the 0.3 percentile is 2.

Bitwise
  • 6,379
  • 2
  • 22
  • 27
  • 1
    Hi Bitwise, I have read your answer several times to make sure I am not crazy. But isn't the 30th percentile 2 in this case? – Jeanno Mar 23 '15 at 12:55
  • @Jeanno of course, nice catch! Corrected. – Bitwise Mar 23 '15 at 13:12
  • thanks for the follow up :), I am struggling to implement this in Excel, I am wondering if you have some ideas ? – Jeanno Mar 23 '15 at 13:15
  • @Bitwise, thanks again for your response. I was able to put together a little VBA code to compute the percentile using binned data. To answer your question, the function PERCENTILE does not work for binned data, you basically have a column of all your data (doesn't have to be sorted) – Jeanno Mar 23 '15 at 14:03
0

PercentileBinnedData is an implementation of the algorithm described above that I have developed, QuickSort will ensure that your binned data is sorted in increasing order

Function PercentileBinnedData(rng As Range, percentile As Double) As Double
    Dim v As Variant
    v = rng.Value
    QuickSortArray v, , , 1 ' sort 2D array
    Dim i As Long
    Dim totalOccurences As Long

    'Convert to cumalative distribution
    For i = LBound(v, 1) To UBound(v, 1)
        If i < UBound(v, 1) Then
            v(i + 1, 2) = v(i + 1, 2) + v(i, 2)
        End If
    Next i
    totalOccurences = v(UBound(v, 1), 2) ' number of occurences is equal to last number of occurences
    Dim rank As Double: rank = percentile * totalOccurences
    For i = LBound(v, 1) To UBound(v, 1)
        If i = LBound(v, 1) And rank <= v(i, 2) Then
            PercentileBinnedData = v(i, 1)
        End If
        If i > LBound(v, 1) Then
            If rank > v(i - 1, 2) And rank <= v(i, 2) Then
                PercentileBinnedData = v(i, 1)
            End If
        End If
    Next i
End Function

Public Sub QuickSortArray(ByRef SortArray As Variant, Optional lngMin As Long = -1, Optional lngMax As Long = -1, Optional lngColumn As Long = 0)
    On Error Resume Next

    'Sort a 2-Dimensional array

    ' SampleUsage: sort arrData by the contents of column 3
    '
    '   QuickSortArray arrData, , , 3

    '
    'Posted by Jim Rech 10/20/98 Excel.Programming

    'Modifications, Nigel Heffernan:

    '       ' Escape failed comparison with empty variant
    '       ' Defensive coding: check inputs

    Dim i As Long
    Dim j As Long
    Dim varMid As Variant
    Dim arrRowTemp As Variant
    Dim lngColTemp As Long

    If IsEmpty(SortArray) Then
        Exit Sub
    End If
    If InStr(TypeName(SortArray), "()") < 1 Then  'IsArray() is somewhat broken: Look for brackets in the type name
        Exit Sub
    End If
    If lngMin = -1 Then
        lngMin = LBound(SortArray, 1)
    End If
    If lngMax = -1 Then
        lngMax = UBound(SortArray, 1)
    End If
    If lngMin >= lngMax Then    ' no sorting required
        Exit Sub
    End If

    i = lngMin
    j = lngMax

    varMid = Empty
    varMid = SortArray((lngMin + lngMax) \ 2, lngColumn)

    ' We  send 'Empty' and invalid data items to the end of the list:
    If IsObject(varMid) Then  ' note that we don't check isObject(SortArray(n)) - varMid *might* pick up a valid default member or property
        i = lngMax
        j = lngMin
    ElseIf IsEmpty(varMid) Then
        i = lngMax
        j = lngMin
    ElseIf IsNull(varMid) Then
        i = lngMax
        j = lngMin
    ElseIf varMid = "" Then
        i = lngMax
        j = lngMin
    ElseIf VarType(varMid) = vbError Then
        i = lngMax
        j = lngMin
    ElseIf VarType(varMid) > 17 Then
        i = lngMax
        j = lngMin
    End If

    While i <= j
        While SortArray(i, lngColumn) < varMid And i < lngMax
            i = i + 1
        Wend
        While varMid < SortArray(j, lngColumn) And j > lngMin
            j = j - 1
        Wend

        If i <= j Then
            ' Swap the rows
            ReDim arrRowTemp(LBound(SortArray, 2) To UBound(SortArray, 2))
            For lngColTemp = LBound(SortArray, 2) To UBound(SortArray, 2)
                arrRowTemp(lngColTemp) = SortArray(i, lngColTemp)
                SortArray(i, lngColTemp) = SortArray(j, lngColTemp)
                SortArray(j, lngColTemp) = arrRowTemp(lngColTemp)
            Next lngColTemp
            Erase arrRowTemp

            i = i + 1
            j = j - 1
        End If
    Wend

    If (lngMin < j) Then Call QuickSortArray(SortArray, lngMin, j, lngColumn)
    If (i < lngMax) Then Call QuickSortArray(SortArray, i, lngMax, lngColumn)

End Sub
Jeanno
  • 101
  • 2