Significance testing on two groups (distributions) of many binomial distributions

Question

Basically I have two (or more) different success probability generating distributions. In other words, there are two (or more) different (non-normal) distributions, which realize success probabilities for two groups of binomial distributions.

Empirically I have the number of successes and attempts for each binomial distribution, and also which binomial distributions belongs to which group. Of course, if I know the true success probabilities for all binomial distributions, I can simply run K-W test or something similar, and have a significance testing on the two distributions. However, since the data I have is the realization of the binomial distributions, I can only estimate the success probabilities for each.

Is there any elegant way to have a significance testing on these two non-normal distributions?

This question isn't totally clear to me, so to clarify, I'll try to restate: You have $n_i$ and $x_i$ (# of trials, successes) for each of several binomial distributions, which differ in their probability parameters $p_i$. The distributions are in several known groups $j=1,\dots, J$; each group has its $p_i$ drawn from a group-specific, unknown, probability distribution $f_j$. You are attempting to test whether the $f_j$ are different, given knowledge of the $(n_i, x_i)$ pairs and the group memberships. Does this seem right? — jbowman, Sep 04 '18 at 19:40
Yes, you are totally correct. I was thinking about introducing some general, formal notations here and you already did it just right. Thanks! — user217357, Sep 04 '18 at 20:32

kjetil b halvorsen · Answer 1 · 2020-03-16T11:32:46.547

Reformulation based on the comment by @jbowman: There are $J$ groups indexed by $j=1, \dotsc,J$. For each group we have some independent binomial observations, $x_{ji} \sim \mathcal{Binom}(n_{ji},p_{ji})$ conditional on the probability parameter $p_{ji}$, which are in turn sampled (independently) from some distribution on the interval $(0,1)$, $f(p; \theta_j)$. The binomial probabilities $p_{ji}$ are not observed, so they can be seen as a latent variable. The groups then differ in the distribution of binomial probabilities within the group. It is not clear to me if the $n_{ji}$ are given constants or also somehow sampled. Neither is it clear how $f$ is to be modeled, parametrically or nonparametrically.

But I would start out with some parametric model for $f$, just to start with a somewhat tractable problem. So, for a start, let us assume that $f(p; \theta_j)$ is a beta distribution. That will give a point of comparison for a more general solution, at least. Let us also, for a start, assume that the $n_{ji}=n_j$ are constants within the groups. First, observe that if $n_j=1$ we are lost! since then the resulting beta-binomial distribution collapses to a Bernoulli distribution, and the likelihood function will be a constant on lines in the $(\alpha,\beta)$ parameter space where $\frac\alpha{\alpha+\beta}$ is constant. So assume that $n_j\ge 2$.

Since all the $J$ groups can be treated equally, just focus on one of them, and drop the index $_j$. With $k$ observation in the group, the situation now reduces to $k$ iid observation from the distribution $\mathcal{beta-Binom}(n,\alpha,\beta)$, and can be solved with maximum likelihood. Even if the binomial $n$ varies within the group, the same solution applies.

So, what can we do in a more general case, when the distribution generating the binomial probabilities is more general, $f(p_i ; \theta)$? What we did above was integrating out the $p$ parameter with respect to its distribution $f$. That could be solved explicitly, because the beta distribution is the conjugate distribution of the binomial. But the same principle should apply, even if the integration cannot be done explicitly. A general discussion is here: MLE: Marginal vs Full Likelihood

In this case the integral is one-dimensional over $(0,1)$, so for a parametric family could be done numerically. But I suspect that the intended case is nonparametric, so what to do then?

(I post this for now, and will try to come back for that last Q)

Significance testing on two groups (distributions) of many binomial distributions

1 Answers1