2

Note: This was first posted on StackOverflow, but I did not have much luck there.

I have a very large database. Each item in the database is assigned a uniformly distributed random value $\geq 0$ and $<1$. This database is so large that performing a COUNT of a query (i.e., determining how many results are in the query) is very slow. However, I'd like to use the random values to estimate the total size of a query (COUNT).

The idea is this:

  1. Run a query and order the query by the random value. Random values are indexed, so it's fast.
  2. Grab the lowest $N$ values, and see how big the largest value is, say $R$. Estimate COUNT as $N/R$

The question is two-part:

  1. Is $N / R$ the best way to estimate COUNT? Maybe it should be $(N+1)/R$? Maybe we could look at the other values (average, variance, etc), and not just the largest value to get a better estimate?
  2. What is the error margin on this estimated value of COUNT?
speedplane
  • 121
  • 3
  • 2
    I'm curious: how can it happen that it's more efficient to report the smallest $N$ values rather than just the count of values? Is it that the query is exploiting an index on these random values in a way that enables it to find those $N$ values first without finding the rest? – whuber Jul 02 '19 at 17:41
  • 1
    For those interested this has been posted on math stack exchange as well. https://math.stackexchange.com/questions/3280227/probability-estimating-database-size-with-n-smallest-random-values Relevant because a similar question to @whuber came up there. – Kitter Catter Jul 02 '19 at 17:42
  • 1
    I hate to ask, but there is an answer to your question on StackOverflow. What is the problem with that answer? – Kitter Catter Jul 02 '19 at 17:48
  • @whuber in distributed systems that store massive amounts of data across many computers with many writes/deletes per second, it's still relatively easy to add/remove an ordered item, but it's difficult to maintain an accurate count of all items. This issue only arises when your database grows so large that it has to span many computers across a network. Insertion will only effect one node on the network, but maintaining a comprehensive count requires coordination across many nodes. – speedplane Jul 03 '19 at 06:29
  • To clarify slightly, insertion will require coordination between O(log(COUNT)) number of nodes, while determining the COUNT will require coordination of O(COUNT) nodes, where "O" is algorithmic complexity. Databases that handle extremely large amounts of data (e.g., Google's BigTable) either don't support COUNT, or have large performance costs. – speedplane Jul 03 '19 at 06:42

0 Answers0