8

Short version

I am trying to analytically solve/approximate the composite likelihood that results from independent Poisson draws and further sampling with or without replacement (I don't really care which one). I want to use the likelihood with MCMC (Stan), so I need the solution only up to a constant term. Ultimately, I want to model a process where the initial draws are from neg. binomial distribution, but I think I will be able to get there with a solution for the Poisson case.

It is well possible that the solution is not feasible (I don't understand the math enough to be able to tell whether this is a simple or very difficult problem). I am thus also interested in approximations, negative results or intuition why the problem is probably intractable (e.g. comparing to a known hard problem). Links to useful papers/theorems/tricks that will help me move forward are good answers even if their connection to the problem at hand is not fully worked out.

Formal statement

More formally, first $Y = (y_1, ..., y_N), y_n \sim Pois(\lambda_n)$ is drawn independently and then I sample $k$ items at random from all of $Y$ to get $Z = (z_1,...,z_N)$. I.e. I draw $k$ coloured balls from an urn where the amount of balls of color $n$ is drawn from $Pois(\lambda_n)$. Here, $k$ is assumed known and fixed and we condition on $\sum_n y_n \geq k$. Technically the sampling is done without replacement, but assuming sampling with replacement should not be a big deal.

I have tried two approaches to solve for sampling without replacement (as this seemed to be the easier case due to some terms cancelling out), but got stuck with both. The likelihood when sampling without replacement is:

$$ P(Z = (z_1, ..., z_N) | \Lambda = (\lambda_1, ..., \lambda_N)) = \frac{ \sum_{Y;\forall n: y_n \geq z_n} \left( \frac{\prod_{n=1}^N{y_n \choose z_n}}{ {\sum_{n=1}^N y_n} \choose k} \prod_{n=1}^N Poisson(y_n |\lambda_n) \right) }{ P(\sum_n y_n \geq k|\Lambda) } $$

EDIT: The "attempted solutions section was removed as the solution in the answer does not build on them (and is way better)

Martin Modrák
  • 2,065
  • 10
  • 27

1 Answers1

3

The case without replacement

If you have $n$ independent Poisson distributed variables

$$Y_i \sim \text{Pois$(\lambda_i)$}$$

and condition on

$$\sum_{j=i}^n Y_j = K$$

then

$$\lbrace Y_i \rbrace \sim \text{Multinom} \left(K,\left(\frac{\lambda_i}{\sum_{j=1}^n \lambda_j} \right)\right)$$


So you could fill your urn with the $n$ times $Y_i$ colored balls like first drawing the value for the total $K$ (which is Poisson distributed cutoff by the condition $K \geq k$) and then fill the urn with $K$ balls according to the multinomial distribution.

This filling of the urn with $K$ balls, according to a multinomial distribution, is equivalent to drawing for each ball independently the color from the categorical distribution. Then you can consider the first $k$ balls that have been added to the urn as defining the random sample $\lbrace Z_i \rbrace$ (when this sample is drawn without replacement) and the distribution for this is just another multinomial distributed vector:

$$\lbrace Z_i \rbrace \sim \text{Multinom} \left(k,\left(\frac{\lambda_i}{\sum_{j=1}^n \lambda_j} \right)\right)$$

simulation

##### simulating sample process for 3 variables #######


# settings
set.seed(1)
k = 10
lambda = c(4, 4, 4)
trials = 1000000

# observed counts table
Ocounts = array(0, dim = c(k+1, k+1, k+1))

for (i in c(1:trials)) {
  # draw poisson with limit sum(n) >= k
  repeat {
    Y = rpois(3,lambda)
    if (sum(Y) >= k) {break}
  }
  # setup urn
  urn <- c(rep(1, Y[1]), rep(2, Y[2]), rep(3, Y[3]))
  # draw from urn
  s <- sample(urn, k, replace=0)
  Z = c(sum(s==1),sum(s==2),sum(s==3))
  Ocounts[Z[1]+1, Z[2]+1, Z[3]+1] = Ocounts[Z[1]+1, Z[2]+1, Z[3]+1] + 1
}



# comparison
observed = rep(0, 0.5*k*(k+1))
expected = rep(0, 0.5*k*(k+1))   
t = 1

for (z1 in c(0:k)) {
  for (z2 in c(0:(k-z1))) {  
    z3 = k-z1-z2 
    observed[t] = Ocounts[z1+1, z2+1, z3+1]
    expected[t] = trials*dmultinom(c(z1, z2, z3), prob=lambda)
    t = t+1
  }
}

plot(observed,expected)
x2 <- sum((observed-expected)^2/expected)
pvalue <- 1-pchisq(x2, 66-1)

results

> # results from chi-sq test
> x2
[1] 75.49286
> pvalue
[1] 0.1754805

Negative binomial

The arguments would work the same for the case of a negative binomial distribution which results (under certain conditions) into a Dirichlet-multinomial distribution.

Below is an example simulation

##### simulating sample process for 3 variables #######

# dirichlet multinomial for vectors of size 3
ddirmultinom =  function(x1,x2,x3,p1,p2,p3) {
  (factorial(x1+x2+x3)*gamma(p1+p2+p3)/gamma(x1+x2+x3+p1+p2+p3))/
  (factorial(x1)*gamma(p1)/gamma(x1+p1))/
  (factorial(x2)*gamma(p2)/gamma(x2+p2))/
  (factorial(x3)*gamma(p3)/gamma(x3+p3))
}

# settings
set.seed(1)
k = 10
theta = 1
lambda = c(4,4,4)
trials = 1000000

# calculating negative binomials pars
means = lambda
vars = lambda*(1+theta)

ps = (vars-means)/(vars)
rs = means^2/(vars-means)


# observed counts table
Ocounts = array(0, dim = c(k+1,k+1,k+1))

for (i in c(1:trials)) {
  # draw poisson with limit sum(n) >= k
  repeat {
    Y = rnbinom(3,rs,ps)
    if (sum(Y) >= k) {break}
  }
  # setup urn
  urn <- c(rep(1,Y[1]),rep(2,Y[2]),rep(3,Y[3]))
  # draw from urn
  s <- sample(urn,k,replace=0)
  Z = c(sum(s==1),sum(s==2),sum(s==3))
  Ocounts[Z[1]+1,Z[2]+1,Z[3]+1] = Ocounts[Z[1]+1,Z[2]+1,Z[3]+1] + 1
}



# comparison
observed = rep(0,0.5*k*(k+1))
expected = rep(0,0.5*k*(k+1))   
t = 1

for (z1 in c(0:k)) {
  for (z2 in c(0:(k-z1))) {  
    z3 = k-z1-z2 
    observed[t]=Ocounts[z1+1,z2+1,z3+1]
    expected[t]=trials*ddirmultinom(z1,z2,z3,lambda[1]/theta,lambda[2]/theta,lambda[3]/theta)
    t = t+1
  }
}

plot(observed,expected)
x2 <- sum((observed-expected)^2/expected)
pvalue <- 1-pchisq(x2,66-1)

# results from chi-sq test
x2
pvalue
Sextus Empiricus
  • 43,080
  • 1
  • 72
  • 161
  • 1
    Thanks for the answer. I've tried this approach already and I have two issues a) don't see why conditioning on the sum is the same as resampling (mathematically) and b) my (admittedly limited) simulated data from Poisson + resampling could not be fit well with multinomial distribution. Could you elaborate more on why conditioning and resampling would be the same? – Martin Modrák Aug 03 '18 at 12:07
  • Oh, on further reading, I think I see your point. Maybe I just made a stupid mistake with the simulations, will go check. – Martin Modrák Aug 03 '18 at 12:14
  • I use two steps. **1)** filling the urn with $Y_i$ coloured balls by drawing $Y_i$ from independent Poisson distributions, is equivalent to filling the urn by drawing a total $\sum Y_i = K$ from the Poisson distribution and then the $Y_i$ from a multinomial distribution. **2)** drawing a subset of balls from an urn filled with coloured balls whose numbers are defined by a multinomial distribution, is equivalent to anoter multiniomial distribution (intuition, you can see the urn as being filled with for the color of each ball a draw from the categorical distribution). – Sextus Empiricus Aug 03 '18 at 13:29
  • Yes, there was a bug in my simulation code :-) Thanks for the help! You are correct and I also understand the step in your reasoning that I didn't see previously. I am trying to understand why the same relation does not hold for neg. binomial and Dirichlet multinomial distribution. (assuming the NB variables have constant Fano factors, conditioning on their sum I get DM) - because "sampling without replacement from multinomial" is multinomial, but "sampling without replacement from DM" is not DM (is that reasoning correct in your view?) – Martin Modrák Aug 03 '18 at 14:24
  • I would guess that the *separate* steps would work but not together. So 1) the conditioned neg.binomial results in DM (according to your https://stats.stackexchange.com/questions/45318/conditional-on-the-total-what-is-the-distribution-of-negative-binomials/354433#354433). and 2) subsampling from the DM based urn is also a DM. ....... But, would you get the same parameters in the DM for different values of K? – Sextus Empiricus Aug 03 '18 at 14:32
  • You actually do get the same parameters of the DM for different $K$. But the problem might be that in the multinomial pmf, $K$ only affects the normalization constant, while in the DM $K$ interacts with the other parameters. – Martin Modrák Aug 03 '18 at 14:48
  • I will do another simulation to test my intuition. I don't see why the same argments as in the answer would not work the same for negative-binomial and Dirichlet-multinomial. Did your experimental simulations use 'without replacement'? – Sextus Empiricus Aug 03 '18 at 14:56
  • 1
    Turns out I also messed up the second simulation. It seems to work for the DM case as well. – Martin Modrák Aug 03 '18 at 15:12
  • Let us [continue this discussion in chat](https://chat.stackexchange.com/rooms/81114/discussion-between-martin-modrak-and-martijn-weterings). – Martin Modrák Aug 03 '18 at 15:18