2

I have a big bag of balls, each one marked with a number between 0 and $n$. The same number may appear on more than one ball. We can assume that the numbers on the balls follow a binomial distribution.

Now I pick a random ball from the bag, read its number $x$ and put the ball back. Then I pick a second random ball from the bag, read its number $y$ and put it back.

I compute $z = |x - y|$. What is the distribution of $z$?

My calculations led me to the result that it's a chi distribution with one degree of freedom (or better, its discrete equivalent). To obtain this result, I used the normal instead of the binomial. I wonder if this result is correct, and how it can be obtained without approximating the binomial with the normal.

Likk
  • 123
  • 4
  • Related https://math.stackexchange.com/questions/562119/difference-of-two-binomial-random-variables and https://math.stackexchange.com/questions/1065487/difference-between-two-independent-binomial-random-variables-with-equal-success/1171684 – user2974951 Jan 23 '19 at 12:21
  • 1
    *"I have a big bag of balls, each one marked with a number between 1 and n"* So the distribution of the numbers on these balls (and how many balls there are in total) is not specified completely? Or do you just want to ask, in a very indirect way, the distribution of the difference of two independent and similar binomial distributed variables? (note: picking a ball from a bag with balls that have numbers which are binomial distributed, is not the same as picking independent binomial distributed variables, imagine for example the case of one single ball in the bag). – Sextus Empiricus Jan 23 '19 at 13:19
  • The distribution cannot possibly be chi-squared because it is discrete and bounded. Possibly, when $n$ is large, a *chi* distribution would be a decent approximation, depending on your purpose. – whuber Jan 23 '19 at 13:48
  • @whuber: you're right, sorry. I meant _chi_ distribution, not chi-squared. Don't know why I wrote that... – Likk Jan 23 '19 at 15:10
  • @MartijnWeterings: how many balls are in the bag is not specified completely, that's right. The distribution of the numbers is given in this sense: the probability of picking a ball with number $x$ is distributed like $\text{Binom}(n, p)$ (for some $p$). I hope this clarifies my question! – Likk Jan 23 '19 at 15:13
  • 1
    There is no such thing as a chi distribution with zero degrees of freedom, though. Yours is (very approximately) $\sqrt{2p(1-p)n}$ times a chi distribution with one df. The approximation may be poor near zero unless $p(1-p)n$ is large. – whuber Jan 23 '19 at 15:15
  • You can divide your problem in two parts. You have a small probability $\frac{1}{n_{balls}}$ that $y$ is the same ball as $x$ in which case the difference is zero. You have another probability that $x$ and $y$ are not the same balls in which case the difference is according to the distribution of two similar and independent binomial distributed variables..... – Sextus Empiricus Jan 23 '19 at 15:18
  • 1
    ...... this latter one, the difference of two binomial distributed variables, is not easy to express. You could see it as the sum of a categorial variable which has: $$p(x) = \begin{cases} p(1-p) \quad \text{if $x=-1$} \\ 1-2p(1-p) \quad \text{if $x=0$} \\ p(1-p) \quad \text{if $x=1$} \\\end{cases}$$ This is also related with the sum of dice rolls. – Sextus Empiricus Jan 23 '19 at 15:20
  • @MartijnWeterings The difference will typically be zero with much greater probability than that, because to achieve a binomial distribution there will be many pairs of *different* balls that have the same numeric label. For $p$ near $1/2,$ $n$ has to be quite large before the chance of a zero difference becomes appreciably small. – whuber Jan 23 '19 at 15:20
  • @whuber, you are right, but I was seeing it as a mixture distribution where the one distribution (that occurs with probability 1/n) is *always* zero, and the other distribution (that occurs with probability 1-1/n) is sometimes zero. So indeed the probability of zero is larger than 1/n. The main point of my comment is that you can divide the problem into a simple mixture distribution and then the (difficult) core problem is to solve that difference of two binomial distributed variables. – Sextus Empiricus Jan 23 '19 at 15:22
  • @MartijnWeterings Thank you for the explanation. As far as the core problem goes, use the probability generating function, which has a simple form. (*How* you use it will depend on how you prefer to express the distribution.) – whuber Jan 23 '19 at 15:29
  • The complexity that I am thinking about is whether one can simplify the expressions resulting from it (these expressions should be some summation of some sort). – Sextus Empiricus Jan 23 '19 at 15:30
  • I'd also wonder how it is known, if some numbers can be repeated, that the numbers on the balls are between 1 and $n$. Is it because all the balls have been examined and 1 is the smallest numbered ball in the bag and $n$ is the largest numbered ball known to be in the bag or is it just known that numbers between 1 and $n$ *could have* been assigned to balls? – StatsStudent Jan 23 '19 at 16:20

1 Answers1

2

Difference of two independent binomial distributed variables with the same parameters

I have a big bag of balls, each one marked with a number between 1 and n. The same number may appear on more than one ball. We can assume that the numbers on the balls follow a binomial distribution.

Now I pick a random ball from the bag, read its number x and put the ball back. Then I pick a second random ball from the bag, read its number y and put it back.

The core of this question is answered by the difference of two independent binomial distributed variables with the same parameters $n$ and $p$. Let's phrase this as:

Let $X \sim Bin(n,p)$, $Y \sim Bin(n,p)$ be independent. Let the difference be $Z = Y-X$, then what is the frequency distribution of $\vert Z \vert$?

The more general situation has been handled on the math forum, as has been mentioned in the comments.

You can solve the difference in two ways.

  • Approximation with a normal distribution that has the same mean and variance. You have $\mu_X=\mu_y = np$ and $\sigma_X^2 = \sigma_Y^2 = np(1-p)$ and related $\mu_Z = 0$ and $\sigma_Z^2 = 2np(1-p)$ so you can approximate $Z \dot\sim N(0,2np(1-p))$ and for $\vert Z \vert$ you can integrate that normal distribution. $$P(\vert Z \vert = k) \begin{cases} \frac{1}{\sigma_Z}\phi(0) & \quad \text{if $k=0$} \\ \frac{2}{\sigma_Z}\phi(\frac{k}{\sigma_Z}) & \quad \text{if $k\geq1$} \end{cases}$$

    which is close to a half normal distribution or chi distribution as you call it, except that the point $k=0$ does not have the factor 2.

  • Compute a sum or convolution taking all possible values $X$ and $Y$ that lead to $Z$. The probability for $X$ and $Y$ is:

    $$f_X(x) = {{n}\choose{x}} p^{x}(1-p)^{n-x}$$ $$f_Y(y) = {{n}\choose{y}} p^{y}(1-p)^{n-y}$$

    The probability for $Z=z \geq 0$ is

    $$f_Z(z) = \sum_{k=0}^{n-z} f_X(k) f_Y(z+k)$$

    and related

    $$P(\vert Z \vert = k) \begin{cases} f_Z(k) & \quad \text{if $k=0$} \\ 2 f_Z(k) & \quad \text{if $k\geq1$} \end{cases}$$

  • The sum can also be expressed with a generalized hypergeometric function. The function $f_Z(z)$ can be written as:

    $$f_Z(z) = \sum_{k=0}^{n-z} \frac{(n!)^2 p^{2k+z} (1-p)^{2n-2k-z}}{(k)!(k+z)!(n-k)!(n-k-z)! } $$

    or as a generalized hypergeometric series

    $$f_Z(z) = \sum_{k=0}^{n-z} { \beta_k \left(\frac{p^2}{(1-p)^2}\right)^{k}} $$

    with $$ \beta_0 = {{n}\choose{z}}{p^z(1-p)^{2n-z}}$$

    and $$\frac{\beta_{k+1}}{\beta_k} = \frac{(-n+k)(-n+z+k)}{(k+1)(k+z+1)}$$

    such that we can write $f_Z(z)$ in terms of a hypergeometric function :

    $$f_Z(z) = {{n}\choose{z}}{p^z(1-p)^{2n-z}} {}_2F_1\left(-n;-n+z;z+1;p^2/(1-p)^2\right)$$

    if $p=0.5$ (ie $p^2/(1-p)^2=1$ ) then the function simplifies to

    $$f_Z(z) = {{2n}\choose{z+n}}p^{2n}$$

    and we could say if $p=0.5$ then $Z+n \sim Bin(2n,0.5)$.

    This result for $p=0.5$ could also be derived more directly by $$f_Z(z) = 0.5^{2n} \sum_{k=0}^{n-z} {{n}\choose{k}}{{n}\choose{z+k}} = 0.5^{2n} \sum_{k=0}^{n-z} {{n}\choose{k}}{{n}\choose{n-z-k}} = 0.5^{2n} {{2n}\choose{n-z}}$$ using Vandermonde's identity

computational example:

Below is an example of the above results compared with a simulation. The small difference shows that the normal approximation does very well.

example of computation

library(hypergeo)

n <- 30
z <- 0:n
p <- 0.6

# simulate
set.seed(1)
ns <- 100000
X <- rbinom(ns,n,p)
Y <- rbinom(ns,n,p)
Z <- abs(X-Y)

# compute 1 exact
beta0 <- factorial(n)*p^z*(1-p)^(2*n-z)/factorial(n-z)/factorial(z)
ps1 <- beta0*Re(hypergeo(-n,-n+z,z+1,p^2/(1-p)^2))

# compute 2 normal approximation
ps2 <- dnorm(z,0,sqrt(2*n*p*(1-p)))

# plot
hist(Z,breaks = c(z,n+1)-0.5, freq=0, main = "Histogram of simulation compared with computed frequencies \n Bin(30,0.6)")
points(z,Re(ps1)*c(1,rep(2,n)),pch=21,col="black",bg="white",cex=1)
points(z,ps2*c(1,rep(2,n)),pch=3,col="black",bg="white",cex=1)

legend(15,0.20,c("computed exact probability","computed normal approximation"),
       pch=c(21,3),cex=c(1,1))

Mixture distribution

I bought some balls, all blank. I take a binomial random number generator, configure it with some $n$ and $p$, and for each ball I paint the number that I get from the display of the generator. Then I put the balls in a bag and start the process that I described.

In the case that the numbers on the balls are considered random variables (that follow a binomial distribution). Then the frequency distribution for the difference $X-Y$ is a mixture distribution where the number of balls in the bag, $m$, plays a role.

You have two situations:

  1. The first and second ball that you take from the bag are the same. This situation occurs with probability $\frac{1}{m}$. In this case the difference $\vert x-y \vert$ is equal to zero.

  2. The first and second ball are not the same. This situation occurs with probability $1-\frac{1}{m}$. In this case the difference $\vert x-y \vert$ is distributed according to the difference of two independent and similar binomial distributed variables.

The above situation could also be considered a compound distribution where you have a parameterized distribution for the difference of two draws from a bag with balls numbered $x_1, ... ,x_m$ and these parameters $x_i$ are themselves distributed according to a binomial distribution.

computational example

Below is an example from a result when 5 balls $x_1,x_2,x_3,x_4,x_5$ are placed in a bag and the balls have random numbers on them $x_i \sim N(30,0.6)$. The probability for the difference of two balls taken out of that bag is computed by simulating 100 000 of those bags. (note this is not the probability distribution of the outcome for a particular bag which has only at most 11 different outcomes)

computation example

library(hypergeo)

n <- 30
z <- 0:n
p <- 0.6
nb <- 5  

# simulate   (make ns bags, and sample from them)
set.seed(1)
ns <- 100000
bags <- matrix(rbinom(ns*nb,n,p),ns)
X <- apply(bags,1, function(x) sample(x,1))
Y <- apply(bags,1, function(x) sample(x,1))

Z <- abs(X-Y)

# compute 1 exact
beta0 <- factorial(n)*p^z*(1-p)^(2*n-z)/factorial(n-z)/factorial(z)
ps1 <- beta0*Re(hypergeo(-n,-n+z,z+1,p^2/(1-p)^2))

# compute 2 normal approximation
ps2 <- dnorm(z,0,sqrt(2*n*p*(1-p)))

# plot
hist(Z,breaks = c(z,n+1)-0.5, freq=0, main = "Histogram of simulation compared with computed frequencies \n 5 balls in the bag with numbers sampled from Bin(30,0.6)")
points(z,(1-1/nb)*Re(ps1)*c(1,rep(2,n))+c(1/nb,rep(0,n)),pch=21,col="black",bg="white",cex=1)
points(z,(1-1/nb)*ps2*c(1,rep(2,n))+c(1/nb,rep(0,n)),pch=3,col="black",bg="white",cex=1)

legend(15,0.20,c("computed exact probability","computed normal approximation"),
       pch=c(21,3),cex=c(1,1))
Sextus Empiricus
  • 43,080
  • 1
  • 72
  • 161
  • Since the balls follow a binomial distribution, why would the number of balls in a bag ($m$) matter? Nothing should depend on this, nor should it be useful in finding an answer. Your example in assumption (2) appears to contradict the assumed binomial distribution. I wonder whether you are interpreting "binomial distribution" in some unusual way? – whuber Jan 23 '19 at 16:22
  • @whuber, consider the case when the bag contains only 1 ball (which is assigned randomly a number according to the binomial distribution). Then $x$ and $y$ will be the same value (even though the balls inside the bag have been assigned independently random numbers, that does not mean that the balls that we draw from the bag are independent, this is because we have a possibility of drawing the same ball twice) – Sextus Empiricus Jan 23 '19 at 16:54
  • So, say I wish to experimentally derive the distribution by simulating a number $N$ times drawing $x$ and $y$, then my interpretation is to simulate $N$ *different* bags with each $m$ balls in them and the balls have numbers on them that are independently drawn from a binomial distribution. – Sextus Empiricus Jan 23 '19 at 17:32
  • 1
    Although the question is somewhat unclear (the values of a Binomial$(n)$ distribution range from $0$ to $n,$ not $1$ to $n$), it is difficult to see how your interpretation matches the statement "We can assume that the numbers on the balls follow a binomial distribution." That's a very specific description of the frequencies of these $n+1$ numbers and it does not depend on random sampling or simulation. – whuber Jan 23 '19 at 17:44
  • Because the bag is a model for a population. You can understand it either as being a close approximation--with the approximation made as close as one might wish simply by increasing the numbers of balls--or as containing a Nonstandard infinite number of balls. Regardless, "following a distribution" and "drawn from a distribution" are such distinct phrases and concepts that we should maintain the distinction and take the OP at his word. – whuber Jan 23 '19 at 19:24
  • I think it's neither common nor ambiguous, but I'm sure our experiences vary. The problem statement is explicit: there's a bag of balls and they have numerical labels as described; no sampling was involved or implied. I don't understand what you mean by the "other half" of the question. BTW, an exact answer is not hard to obtain (if you like Hypergeometric functions). Just about any demonstration of the approximation will be tantamount to a proof of a special case of the CLT. – whuber Jan 23 '19 at 19:43
  • Let's put it this way: I bought some balls, all blank. I take a binomial random number generator, configure it with some $n$ and $p$, and for each ball I paint the number that I get from the display of the generator. Then I put the balls in a bag and start the process that I described. Does it make sense now? – Likk Jan 23 '19 at 19:44
  • In that case the number of balls in the bag and the use of the mixture distribution *does* matter. – Sextus Empiricus Jan 23 '19 at 19:46
  • Likk, that's still a little vague. To illustrate the difference, suppose $n=2$ and $p=1/2.$ Your question *appears* to assert that a quarter of the balls are labeled with $0,$ half with $1,$ and a quarter with $2.$ If you use a random number generator, then all we know for sure is the values are among $0,1,2,$ but the exact proportions are up to chance. In this latter case your question has no definite answer (because of that uncertainty) whereas in the former case it has a unique correct answer. Which question are you trying to ask? – whuber Jan 23 '19 at 19:52
  • 1
    @whuber: of course reality is up to chance, just like, for example, if we toss a coin 100 times, it's possible to obtain 100 heads. Here I'm not interested in a specific instance of the problem, but in the more "probable" case, which is the case that follows closely the model. If you assume that with $n=2$ and $p=1/2$ a quarter of the balls is 0, half is 1, and a quarter is 2, than that's a perfectly valid assumption! That's _exactly_ the assumption that I'm making – Likk Jan 23 '19 at 19:58
  • Thank you for that search Martijn, but I don't find it relevant because in the current setting *no sampling is involved* in creating the box. The OP is simply describing a distribution; they are not describing a sample from a distribution. – whuber Jan 23 '19 at 20:04
  • I'm going to exit this dialog, Martijn, because we're making no progress. There's no reference to sampling in the question and the OP has kindly, and quite clearly, clarified its meaning. – whuber Jan 23 '19 at 21:18
  • Whatever the interpretation of the question, the answer should be sufficient to help out. I have added two simulations to clearly show two different interpretations that have been used in the answer. – Sextus Empiricus Apr 06 '20 at 08:27