I'm not a statistics expert, so I'm hoping someone here can lend me a hand.
I've got a bunch of key-value pairs associated with a specific time range. Something like this:
<time>|<key>|<value>
0 |1 |34534
0 |2 |23434
0 |3 |4606
1 |1 |945954
1 |6 |459459
1 |8 |34
There will be 10's of millions of key-value pairs over a variety of time values (24 unique values). I need to be able to efficient calculate the "Top 10" sum of values across all time values grouped by the key value.
I don't think that I'll be able to efficiently do this unless I split the problem. So, my theory is that I can split the calculation across each individual value of time.
In the example above, that would give me two jobs:
Job1
<time>|<key>|<value>
0 |1 |34534
0 |2 |23434
0 |3 |4606
Job2
<time>|<key>|<value>
1 |1 |945954
1 |6 |459459
1 |8 |34
Now imagine that I have millions of key-value pairs for each time value. I calculate say the Top 1000 sums ordered by the value in a descending fashion. I then aggregate all of the Top 1000 job sums together to calculate the "Top 10" of the full time range.
I'm using 1000 as an example here, but what is a more precise value that will guarantee my "Top 10" will be correct?
I'm not even quite sure where to start with this.
Vinbot