3

We have $N$ buckets in which we will put some balls. Before that, the buckets are split into two groups, group $A$ and group $B$. The number of balls that we will put in each bucket is drawn from a Binomial distribution. In group $A$, the parameters of this Binomial distribution are $n$ and $p_A$, while in group $B$, the parameters of the Binomial distribution are $n$ and $p_B$.

Given $p_A$, $p_B$ and $n$, what strength of association (number of balls ~ groups) should one expect to find? Given $N$, what is the 95% confidence interval?

If we can't get a solution analytically, I would welcome a piece of code that can do some numerical estimations (I started below with a tiny and very slow R code). Numerical estimations have the advantage that they will provide the whole distribution while it will probably be very complicated to provide to calculate the whole distribution analytically.

Numerical estimations with R

Here is a quick R code to plot the distribution of the coefficient of correlation for chosen values of $n$, $p_A$, $p_B$ and $N$

# Settings    
N = 200
pA = 10^(-6)
pB = 10^(-5)
n = 10^5
nbreplicates = 1000

# Simulations
groups = rep(c("A", "B"), N/2)
r.squares = c()

for (replicate in 1:nbreplicates){
    buckets = c()
    for (i in 1:N){
        if (i%%2 != 0){
            buckets = append(buckets, rbinom(1,n, pA))  
        } else{
            buckets = append(buckets, rbinom(1,n, pB))  
        }
    }
    r.squares = append(r.squares, summary(lm(buckets ~ groups))$r.squared)
}
hist(r.squares)
Remi.b
  • 4,572
  • 12
  • 34
  • 64
  • 2
    In the special case that $p_A$ = $p_B$, then there would be no association in the population. [The distribution of $R^2$ when the null hypothesis is true is comprehensively discussed in this thread](http://stats.stackexchange.com/q/130069/22228). It would be a Beta distribution. You are regressing on one variable (so $k=2$ when we include the intercept term) so the mode is 0; try setting $p_A$ equal to $p_B$ (and lower your $N$) then compare your histogram to [these theoretical graphs](http://i.stack.imgur.com/Q6uGx.png) - it should be similar to the left-hand $k=2$ column. – Silverfish Jan 16 '15 at 00:13
  • All this is very nice stuff. It works indeed. Thanks for this comment. I didn't expect to have a description of the whole distribution of $r^2$. Now my particular interest is when $p_A ≠ P_B$. – Remi.b Jan 16 '15 at 00:32
  • That's where it all gets harder :) If you don't expect to be able to find the whole distribution, does that mean you are happy to find the answer purely computationally? Perhaps you should make this clear in your question. – Silverfish Jan 16 '15 at 00:36
  • I am hoping to get an analytic solution. But if this is too hard, I would definitely welcome a piece of code that can does numerical estimations. Numerical estimations has the advantage that it will provide the whole distribution while it will probably be very complicated to provide analytically. I edited my post to clarify this point. Thanks – Remi.b Jan 16 '15 at 00:46
  • 2
    In the case where the probabilities are different note that you'll have heteroscedasticity which makes things even more complicated. – Silverfish Jan 16 '15 at 23:41

0 Answers0