5

I need to calculate the expected number of hash collisions with a range for a software project. I think this is a reformulation of the birthday problem, as follows.

Suppose you have $n$ balls allocated at random to $d$ bins. What is the greatest value $m$ such that the probability of getting at least $m$ balls in the same bin exceeds one-half?

Ben
  • 91,027
  • 3
  • 150
  • 376
significance
  • 103
  • 5
  • 4
    How do you conceive of this as differing from the Birthday Problem? As far as I can tell, the situation is identical. – whuber May 11 '21 at 20:50
  • @whuber the birthday problem asks 'the probability that, in a set of n randomly chosen people, some pair of them will have the same birthday.', this question is - 'given a set of individuals, how many people will have the same birthday with probability > 50' – significance May 15 '21 at 11:10
  • 1
    Presently unclear what you're asking: please try to reframe your problem more clearly ("max number of numbers which will cause bucket collisions" is particularly unclear). – Ben May 15 '21 at 11:23
  • @Ben thanks for your response - i have tried to improve it, hope that is better! – significance May 15 '21 at 12:21
  • I am reading what appear to be two separate questions and both of them are so vague that I can find several distinct interpretations. Could you perhaps provide some examples of the event you are trying to describe? – whuber May 15 '21 at 12:51
  • 1
    Are you asking, "for fixed $n$ and $d$, what is the greatest $m$ such that if $n$ people are chosen at random, the probability that it is possible to find $m$ individuals among them with the same birthday exceeds 0.5"? – fblundun May 15 '21 at 13:01
  • 1
    We have some good threads related to the underlying distribution in this problem--they are probably worth consulting. [Here's the search link.](https://stats.stackexchange.com/search?tab=votes&q=birthday%20distribution) – whuber May 15 '21 at 13:12
  • thanks @fblundun yes that's it, have edited – significance May 15 '21 at 13:25
  • 1
    @whuber thanks, taking a look :D – significance May 15 '21 at 13:25
  • also, many thanks for bearing with me here, i had reused variable names and my original question wasn't clear at all! – significance May 15 '21 at 13:29
  • I have made a major edit to the question to simplify the problem statement and frame it more generally in terms of balls/bins. Please check to ensure that the problem statement is an accurate statement of the problem of interest to you. – Ben May 16 '21 at 10:28

2 Answers2

4

In order to be as responsive as possible to the applied scenario of interest to you, I am going to frame my analysis in terms of data objects that take on a random hash value, so that objects with the same hash value give hash collisions. The analysis is no different if you use balls/bins instead of data objects/hash values, so please interpret with whatever applied context you wish.

Your problem here involves an analysis of the maximum count from a uniform-multinomial distribution; it can be framed mathematically as follows. Let $X_1,...,X_n \sim \text{IID U} \{ 1,...,d \}$ be a set of $n$ uniform draws from $d$ hash values. Define the count values $N_r = \sum_{i=1}^n \mathbb{I}(X_i = r)$ for each hash value and observe that the count vector has a uniform-multinomial distribution:

$$\mathbf{N} = (N_1,...,N_d) \sim \text{Mu} \Big( n, (\tfrac{1}{d},...,\tfrac{1}{d}) \Big).$$

Define the largest count over a single hash value as $M_n \equiv \max (N_1,...,N_d)$. You are looking for the largest value $m$ such that $\mathbb{P}(M_n \geqslant m) \geqslant 0.5$, which is a quantile of the distribution of $M_n$.

Unfortunately, the distribution of the uniform-multinomial maximum-count is a complicated distribution that does not have a simple closed form. An iterative method for computing the distribution of the maximum is examined in Bonetti, Cirillo and Ogay (2019) (pp. 6-7). It is programmed in R in the occupancy package using standard syntax for probability functions (see related question here). Here I will give an example with $n=20$ and $d = 6$, but you can alter these numbers if you prefer.

#Set the parameters
n <- 20
d <- 6

#Compute and plot the cumulative probabilities from the MaxCount distribution
library(occupancy)
CUMPROBS <- pmaxcount(0:n, size = n, space = d)
names(CUMPROBS ) <- sprintf('M[%s]', 0:n)
plot(0:n, CUMPROBS, col = 'blue', pch = 20, cex = 1.5,
     xlab = 'Maximum Count', ylab = 'Cumulative Probability')
abline(a = 0.5, b = 0, lty = 2, lwd = 2)

#Compute quantile
QUANT <- qmaxcount(0.5, size = n, space = d)
QUANT

[1] 6

enter image description here

In order to get $\mathbb{P}(M_n \geqslant m) \geqslant 0.5$ we can solve the equivalent problem $\mathbb{P}(M_n \leqslant m-1) \leqslant 0.5$. As you can see from the above plot of the cumulative distribution function, the largest value of $m$ that satisfies this requirement is $m=6$. In other words, if we have $n=20$ pieces of data distributed randomly over $d=6$ distinct hash values, there is at least a 50% chance that the maximum number of pieces of data with the same hash is at least six.

Ben
  • 91,027
  • 3
  • 150
  • 376
1

If you are willing to work with estimations of the probabilities there is a rather simple solution with no need of calculation. Here is some R code you can play with.

birthdays = function(n, d){ # Asigns bds to n people
  sample(1:d, n, replace = TRUE )
}

collisions = function(bds, m){ #Check if there are m or more people who share birthda
  any(table(bds)>= m)
}

estimate_prob =function(n,d,m) { 
  mean( replicate(2000, collisions(birthdays(n,d), m)) ) # we generate 2000 samples of bds anc calculate proportion of those where ther are m or more people sharing bday. You can change number o samples to a greater number for more acuracy in estimation.
}

The code is in R. The function birthdays samples a birthday in a year of d days for a total of n people. Collisions returns TRUE if and only if m or more share the same bday. The function estimate_pro estimates the probability of m colissions between n people and a d days year. You can check that for $d = 365$ and $n = 23$ the probability of sharing two people sharing birthday is aproximately $0.5$.

estimate_prob(23, 365, 2)

Then you can use this function and for given $n$ and $d$ estimate the probability of m collisions on a grid. Here is an example

prob = rep(0,10)

for( m in 1:10) {
  prob[m] = estimate_prob(100, 365, m)
}
plot(1:10, prob, xlab = "m")
abline(0.5, 0, col = "red")

This gives the following plot which shows that for $d = 365$ and $n = 100$ the probability of three or more collisions is about $0.6$ and four or more is around $0.05$

enter image description here

Manuel
  • 1,517
  • 12
  • 19