2

I am trying to model a discrete data generating process where I first draw $Y = (y_1, ..., y_N), y_n \sim F(\theta_n)$ independently from some family of discrete distributions $F$ (e.g. negative binomial) and then sample $k$ items at random from all of $Y$ to get $Z = (z_1,...,z_N)$. I.e. I draw $k$ coloured balls from an urn where the amount of balls of color $n$ is drawn from $F(\theta_n)$. Here, $k$ is known and $p(\sum_n{y_n} > k) \simeq 1$. Technically the sampling is done without replacement, but assuming sampling with replacement should not be a big deal.

I have had some limited success approximating this as conditioning on the sum of $Y$, i.e. $p(Z = \tilde{Z}) = p(Y = \tilde{Z}) \left( p(\sum_n{y_n} = k) \right)^{-1}$, but this approximation breaks easily and I am trying to understand why and what can I do to improve it. Are there any theoretical reasons to believe conditioning on the sum is similar to the resampling process I described? My intuition is that both keep the relative proportions of the individual elements of $Z$ and introduce a negative correlation between the elements of $Z$ - but maybe the exact nature of this correlation differs a lot.

I hope there are some general claims to be made, but I am particularly interested in the case where $F$ is the negative binomial distribution, as this allows me to (sometimes) use the Dirichlet Multinomial distribution when conditioning on the sum (as in this answer: conditional on the total, what is the distribution of negative binomials).

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
Martin Modrák
  • 2,065
  • 10
  • 27

0 Answers0