1

I'm not a statistics expert, so I'm hoping someone here can lend me a hand.

I've got a bunch of key-value pairs associated with a specific time range. Something like this:

<time>|<key>|<value>
0     |1    |34534
0     |2    |23434
0     |3    |4606
1     |1    |945954
1     |6    |459459
1     |8    |34

There will be 10's of millions of key-value pairs over a variety of time values (24 unique values). I need to be able to efficient calculate the "Top 10" sum of values across all time values grouped by the key value.

I don't think that I'll be able to efficiently do this unless I split the problem. So, my theory is that I can split the calculation across each individual value of time.

In the example above, that would give me two jobs:

Job1
<time>|<key>|<value>
0     |1    |34534
0     |2    |23434
0     |3    |4606

Job2
<time>|<key>|<value>
1     |1    |945954
1     |6    |459459
1     |8    |34

Now imagine that I have millions of key-value pairs for each time value. I calculate say the Top 1000 sums ordered by the value in a descending fashion. I then aggregate all of the Top 1000 job sums together to calculate the "Top 10" of the full time range.

I'm using 1000 as an example here, but what is a more precise value that will guarantee my "Top 10" will be correct?

I'm not even quite sure where to start with this.

Vinbot

Vinbot
  • 111
  • 3
  • So that everyone will understand you correctly, could you please include an explanation of what you mean by "calculating the top Y" or the "top X"? Also, what do you mean by "against specific timestamps"? – whuber Aug 11 '14 at 17:09
  • Sure, so what I meant was calculate the "Top Y" for half the time values where say Y is 1000. Calculate the "Top Y" for the other half of the time values where say Y is 1000. Basically, splitting the task in half. Then, given those two sets of "Top 1000" values, combine them together by key and get a "Top 10" of the union. I'm wondering what the Y needs to be (1000 in my examples) to guarantee that if I generate a "Top 10" from the combination of these two data sets that it matches the "Top 10" in which I did NOT split the work in half and generated "Top10" from the raw data. – Vinbot Aug 11 '14 at 18:34
  • 1
    I'm sorry, I haven't any clue what you mean by "Y" or even "top Y". How are "X" and "Y" related to your (time, key, value) triples? What does "top" mean? – whuber Aug 11 '14 at 20:41
  • This is a big data question. Google approximate top-n queries eg see druid open source columnar database. The question is how to find top-n items in parallel, where data eg transactions are distributed over 100s of machines. So rather than exhaustively doing the sort on each machine (say billion possible items) and then merging the lists from the 100 machines, to finally only select the top 10 items, he wants to merge in only the top 1000 items from each machine in order to get an approximate top 10 ( of items sold etc) – seanv507 Aug 11 '14 at 22:27
  • @seanv507 thanks, great summary. However, this isnt necessarily multiple machines but it is disributed. – Vinbot Aug 11 '14 at 23:15
  • http://en.m.wikipedia.org/wiki/Selection_algorithm lingo: top-n in databases = kth order statistics in statistics. I think your question might be better asked on stackoverflow, as i assume you are looking for an implementation against your specific situation,/constraints rather than general theory of sampling distribution of kth order statistics – seanv507 Aug 12 '14 at 07:08
  • @seanv507 actually, the implementation I can handle straight-forwardly. The statistical aspect is what i'm struggling with. – Vinbot Aug 12 '14 at 11:50
  • 1
    @sean I appreciate the translation of the jargon, but the situation is still vague. Top *what*? Keys? Values? Counts of keys? Counts of values? Sums of values by key? This question has to be edited so that it has some chance of being interpreted as intended by a majority of readers on this site, for otherwise it can accumulate responses that vary--and perhaps conflict with one another--due to these ambiguities. – whuber Aug 12 '14 at 13:24
  • I've edited the question to try to make it clearer – Vinbot Aug 12 '14 at 14:46
  • 1
    Thank you for improving the question. It sounds like you might be looking for the same technology involved in "online" or "dynamic" estimation of quantiles (a "top 10" or "top 1000" can easily be expressed as a quantile). If so, that gives you a choice of many solutions as described at http://stats.stackexchange.com/questions/7959, http://stats.stackexchange.com/questions/3372, http://stats.stackexchange.com/questions/346, and http://stats.stackexchange.com/questions/103258. Do any of these help? – whuber Aug 12 '14 at 16:20
  • I'll try to digest the suggestions. The first one looks close to what i'd want but I need to spend some time understanding the math first. – Vinbot Aug 12 '14 at 18:26

0 Answers0