Note: This was first posted on StackOverflow, but I did not have much luck there.
I have a very large database. Each item in the database is assigned a uniformly distributed random value $\geq 0$ and $<1$. This database is so large that performing a COUNT
of a query (i.e., determining how many results are in the query) is very slow. However, I'd like to use the random values to estimate the total size of a query (COUNT
).
The idea is this:
- Run a query and order the query by the random value. Random values are indexed, so it's fast.
- Grab the lowest $N$ values, and see how big the largest value is, say $R$. Estimate COUNT as $N/R$
The question is two-part:
- Is $N / R$ the best way to estimate
COUNT
? Maybe it should be $(N+1)/R$? Maybe we could look at the other values (average, variance, etc), and not just the largest value to get a better estimate? - What is the error margin on this estimated value of
COUNT
?