Finding top n across multiple sets

Question

Suppose we have a set of key value pairs, such that the keys can be ordered based on the value . We can easily extract and report the top $n$ keys. Now suppose the we havet S such sets. From each set we are supposed to extract a set of $p$ top keys and merge them to get the top $n$ keys from all the sets. How large should $p$ be? What is the relationship between the $S$, $n$ and $p$ and the error?

I wanted to know if there is any literature on this subject.

Clarification: Its possible to have the same keys in multiple sets. The merge and find top n in the ideal case would construct a combined map of key-value pairs (by adding all values corresponding to the same key) and finding the top n values.

In my data the distribution of values in each set and the combined set has a long tail.

Related questions: http://stats.stackexchange.com/questions/7959/algorithm-to-dynamically-monitor-quantiles, http://stats.stackexchange.com/questions/7251/how-to-make-a-combination-aggregation-of-quantile-forecast, http://stats.stackexchange.com/questions/14006/distributed-quantile-algorithm-with-duplicates. They all contain potentially useful information (even though not all are answered). — whuber, Dec 01 '11 at 16:01

score 2 · Answer 1 · answered Dec 01 '11 at 08:30

Well, if you have some knowledge on the relative distribution your sets are sampled from, interesting things can happen, and you can probably use formulas to calculate how many you need to draw and how large the probability becomes that you've drawn the top n. Note however that this will depend on the distribution of the values, even if you say each of the sets is sampled iid (I imagine there to be quite a difference in the result if you draw them from standard normal as opposed to a discrete distribution with only zeroes and ones).

For now: it is very well possible that one set holds all larger values than any other, so then you will have to draw n from each set to ensure you have the top n...

You may want to specify more details...

Finding top n across multiple sets

1 Answers1