How do you find quantiles in this balls-in-bins problem?

Question

I need to calculate the expected number of hash collisions with a range for a software project. I think this is a reformulation of the birthday problem, as follows.

Suppose you have $n$ balls allocated at random to $d$ bins. What is the greatest value $m$ such that the probability of getting at least $m$ balls in the same bin exceeds one-half?

How do you conceive of this as differing from the Birthday Problem? As far as I can tell, the situation is identical. — whuber, May 11 '21 at 20:50
@whuber the birthday problem asks 'the probability that, in a set of n randomly chosen people, some pair of them will have the same birthday.', this question is - 'given a set of individuals, how many people will have the same birthday with probability > 50' — significance, May 15 '21 at 11:10
Presently unclear what you're asking: please try to reframe your problem more clearly ("max number of numbers which will cause bucket collisions" is particularly unclear). — Ben, May 15 '21 at 11:23
@Ben thanks for your response - i have tried to improve it, hope that is better! — significance, May 15 '21 at 12:21
I am reading what appear to be two separate questions and both of them are so vague that I can find several distinct interpretations. Could you perhaps provide some examples of the event you are trying to describe? — whuber, May 15 '21 at 12:51
Are you asking, "for fixed $n$ and $d$, what is the greatest $m$ such that if $n$ people are chosen at random, the probability that it is possible to find $m$ individuals among them with the same birthday exceeds 0.5"? — fblundun, May 15 '21 at 13:01
We have some good threads related to the underlying distribution in this problem--they are probably worth consulting. [Here's the search link.](https://stats.stackexchange.com/search?tab=votes&q=birthday%20distribution) — whuber, May 15 '21 at 13:12
also, many thanks for bearing with me here, i had reused variable names and my original question wasn't clear at all! — significance, May 15 '21 at 13:29
I have made a major edit to the question to simplify the problem statement and frame it more generally in terms of balls/bins. Please check to ensure that the problem statement is an accurate statement of the problem of interest to you. — Ben, May 16 '21 at 10:28

Ben · Accepted Answer · 2021-08-01T23:16:20.863

In order to be as responsive as possible to the applied scenario of interest to you, I am going to frame my analysis in terms of data objects that take on a random hash value, so that objects with the same hash value give hash collisions. The analysis is no different if you use balls/bins instead of data objects/hash values, so please interpret with whatever applied context you wish.

Your problem here involves an analysis of the maximum count from a uniform-multinomial distribution; it can be framed mathematically as follows. Let $X_1,...,X_n \sim \text{IID U} \{ 1,...,d \}$ be a set of $n$ uniform draws from $d$ hash values. Define the count values $N_r = \sum_{i=1}^n \mathbb{I}(X_i = r)$ for each hash value and observe that the count vector has a uniform-multinomial distribution:

$$\mathbf{N} = (N_1,...,N_d) \sim \text{Mu} \Big( n, (\tfrac{1}{d},...,\tfrac{1}{d}) \Big).$$

Define the largest count over a single hash value as $M_n \equiv \max (N_1,...,N_d)$. You are looking for the largest value $m$ such that $\mathbb{P}(M_n \geqslant m) \geqslant 0.5$, which is a quantile of the distribution of $M_n$.

Unfortunately, the distribution of the uniform-multinomial maximum-count is a complicated distribution that does not have a simple closed form. An iterative method for computing the distribution of the maximum is examined in Bonetti, Cirillo and Ogay (2019) (pp. 6-7). It is programmed in R in the occupancy package using standard syntax for probability functions (see related question here). Here I will give an example with $n=20$ and $d = 6$, but you can alter these numbers if you prefer.

#Set the parameters
n <- 20
d <- 6

#Compute and plot the cumulative probabilities from the MaxCount distribution
library(occupancy)
CUMPROBS <- pmaxcount(0:n, size = n, space = d)
names(CUMPROBS ) <- sprintf('M[%s]', 0:n)
plot(0:n, CUMPROBS, col = 'blue', pch = 20, cex = 1.5,
     xlab = 'Maximum Count', ylab = 'Cumulative Probability')
abline(a = 0.5, b = 0, lty = 2, lwd = 2)

#Compute quantile
QUANT <- qmaxcount(0.5, size = n, space = d)
QUANT

[1] 6

In order to get $\mathbb{P}(M_n \geqslant m) \geqslant 0.5$ we can solve the equivalent problem $\mathbb{P}(M_n \leqslant m-1) \leqslant 0.5$. As you can see from the above plot of the cumulative distribution function, the largest value of $m$ that satisfies this requirement is $m=6$. In other words, if we have $n=20$ pieces of data distributed randomly over $d=6$ distinct hash values, there is at least a 50% chance that the maximum number of pieces of data with the same hash is at least six.

honestly have not had opportunity to use or verify this yet but seems great, thanks! — significance, May 19 '21 at 15:20

Manuel · Answer 2 · 2021-05-15T15:08:56.213

If you are willing to work with estimations of the probabilities there is a rather simple solution with no need of calculation. Here is some R code you can play with.

birthdays = function(n, d){ # Asigns bds to n people
  sample(1:d, n, replace = TRUE )
}

collisions = function(bds, m){ #Check if there are m or more people who share birthda
  any(table(bds)>= m)
}

estimate_prob =function(n,d,m) { 
  mean( replicate(2000, collisions(birthdays(n,d), m)) ) # we generate 2000 samples of bds anc calculate proportion of those where ther are m or more people sharing bday. You can change number o samples to a greater number for more acuracy in estimation.
}

The code is in R. The function birthdays samples a birthday in a year of d days for a total of n people. Collisions returns TRUE if and only if m or more share the same bday. The function estimate_pro estimates the probability of m colissions between n people and a d days year. You can check that for $d = 365$ and $n = 23$ the probability of sharing two people sharing birthday is aproximately $0.5$.

estimate_prob(23, 365, 2)

Then you can use this function and for given $n$ and $d$ estimate the probability of m collisions on a grid. Here is an example

prob = rep(0,10)

for( m in 1:10) {
  prob[m] = estimate_prob(100, 365, m)
}
plot(1:10, prob, xlab = "m")
abline(0.5, 0, col = "red")

This gives the following plot which shows that for $d = 365$ and $n = 100$ the probability of three or more collisions is about $0.6$ and four or more is around $0.05$

How do you find quantiles in this balls-in-bins problem?

2 Answers2