How to sample a truncated multinomial distribution?

Question

I need an algorithm to sample a truncated multinomial distribution. That is,

$$\vec x \sim \frac{1}{Z} \frac{p_1^{x_1} \dots p_k^{x_k}}{x_1!\dots x_k!}$$

where $Z$ is a normalization constant, $\vec x$ has $k$ positive components, and $\sum x_i = n$. I only consider values of $\vec{x}$ in the range $\vec a \le \vec x \le \vec b$.

How can I sample this truncated multinomial distribution?

Note: See Wikipedia for an algorithm to sample a non-truncated multinomial distribution. Is there a way to adapt this algorithm to a truncated distribution?

Uniform version: A simpler version of the problem is take all the $p_i$ equal, $p_i = 1/k$. If you can design an algorithm to sample the truncated distribution in this case at least, please post it. Although not the general answer, that would help me solve other practical problems at the moment.

Tim · Answer 1 · 2016-07-02T19:02:31.430

If I understand you correctly, you want to sample $x_1,\dots,x_k$ values from multinomial distribution with probabilities $p_1,\dots,p_k$ such that $\sum_i x_i = n$, however you want the distribution to be truncated so $a_i \le x_i \le b_i$ for all $x_i$.

I see three solutions (neither as elegant as in non-truncated case):

Accept-reject. Sample from non-truncated multinomial, accept the sample if it fits the truncation boundaries, otherwise reject and repeat the process. It is fast, but can be very inefficient.

rtrmnomReject <- function(R, n, p, a, b) {
  x <- t(rmultinom(R, n, p))
  x[apply(a <= x & x <= b, 1, all) & rowSums(x) == n, ]
}

Direct simulation. Sample in fashion that resembles the data-generating process, i.e. sample single marble from a random urn and repeat this process until you sampled $n$ marbles in total, but as you deploy the total number of marbles from given urn ($x_i$ is already equal to $b_i$) then stop drawing from such urn. I implemented this in a script below.

# single draw from truncated multinomial with a,b truncation points
rtrmnomDirect <- function(n, p, a, b) {
  k <- length(p)

  repeat {
    pp <- p         # reset pp
    x <- numeric(k) # reset x
    repeat {
      if (sum(x<b) == 1) { # if only a single category is left
        x[x<b] <- x[x<b] + n-sum(x) # fill this category with reminder
        break
      }
      i <- sample.int(k, 1, prob = pp) # sample x[i]
      x[i] <- x[i] + 1  
      if (x[i] == b[i]) pp[i] <- 0 # if x[i] is filled do
      # not sample from it
      if (sum(x) == n) break    # if we picked n, stop
    }
    if (all(x >= a)) break # if all x>=a sample is valid
    # otherwise reject
  }

  return(x)
}

Metropolis algorithm. Finally, the third and most efficient approach would be to use Metropolis algorithm. The algorithm is initialized by using direct simulation (but can be initialized differently) to draw first sample $X_1$. In following steps iteratively: proposal value $y = q(X_{i-1})$ is accepted as $X_i$ with probability $f(y)/f(X_{i-1})$, otherwise $X_{i-1}$ value is taken in it's place, where $f(x) \propto \prod_i p_i^{x_i}/x_i!$. As a proposal I used function $q$ that takes $X_{i-1}$ value and randomly flips from 0 to step number of cases and moves it to another category.

# draw R values
# 'step' parameter defines magnitude of jumps
# for Meteropolis algorithm
# 'init' is a vector of values to start with
rtrmnomMetrop <- function(R, n, p, a, b,
                          step = 1,
                          init = rtrmnomDirect(n, p, a, b)) {

  k <- length(p)
  if (length(a)==1) a <- rep(a, k)
  if (length(b)==1) b <- rep(b, k)

  # approximate target log-density
  lp <- log(p)
  lf <- function(x) {
    if(any(x < a) || any(x > b) || sum(x) != n)
      return(-Inf)
    sum(lp*x - lfactorial(x))
  }

  step <- max(2, step+1)

  # proposal function
  q <- function(x) {
    idx <- sample.int(k, 2)
    u <- sample.int(step, 1)-1
    x[idx] <- x[idx] + c(-u, u)
    x
  }

  tmp <- init
  x <- matrix(nrow = R, ncol = k)
  ar <- 0

  for (i in 1:R) {
    proposal <- q(tmp)
    prob <- exp(lf(proposal) - lf(tmp))
    if (runif(1) < prob) {
      tmp <- proposal
      ar <- ar + 1
    }
    x[i,] <- tmp
  }

  structure(x, acceptance.rate = ar/R, step = step-1)
}

The algorithm starts at $X_1$ and then wanders around the different regions of distribution. It is obviously faster then the previous ones, but you need to remember that if you'd use it to sample small number of cases, then you could end up with draws that are close to each other. Another problem is that you need to decide about step size, i.e. how big jumps should the algorithm make -- too small may lead to moving slowly, too big may lead to making too many invalid proposals and rejecting them. You can see example of it's usage below. On the plots you can see: marginal densities in the first row, traceplots in the second row and plots showing subsequent jumps for pairs of variables.

n <- 500
a <- 50
b <- 125
p <- c(1,5,2,4,3)/15
k <- length(p)
x <- rtrmnomMetrop(1e4, n, p, a, b, step = 15)

cmb <- combn(1:k, 2)

par.def <- par(mfrow=c(4,5), mar = c(2,2,2,2))
for (i in 1:k)
  hist(x[,i], main = paste0("X",i))
for (i in 1:k)
  plot(x[,i], main = paste0("X",i), type = "l", col = "lightblue")
for (i in 1:ncol(cmb))
  plot(jitter(x[,cmb[1,i]]), jitter(x[,cmb[2,i]]),
       type = "l", main = paste(paste0("X", cmb[,i]), collapse = ":"),
       col = "gray")
par(par.def)

The problem with sampling from this distribution is that describes a very inefficient sampling strategy in general. Imagine that $p_1 \ne \dots \ne p_k$ and $a_1 = \dots = a_k$, $b_1 = \dots b_k$ and $a_i$'s are close to $b_i$'s, in such case you want to sample to categories with different probabilities, but expect similar frequencies in the end. In extreme case, imagine two-categorical distribution where $p_1 \gg p_2$, and $a_1 \ll a_2$, $b_1 \ll b_2$, in such case you expect something very rare event to happen (real-life example of such distribution would be researcher who repeats sampling until he finds the sample that is consistent with his hypothesis, so it has more to do with cheating than random sampling).

The distribution is much less problematic if you define it as Rukhin (2007, 2008) where you sample $np_i$ cases to each category, i.e. sample proportionally to $p_i$'s.

Rukhin, A.L. (2007). Normal order statistics and sums of geometric random variables in treatment allocation problems. Statistics & probability letters, 77(12), 1312-1321.

Rukhin, A. L. (2008). Stopping Rules in Balanced Allocation Problems: Exact and Asymptotic Distributions. Sequential Analysis, 27(3), 277-292.

Rejecting invalid samples may be too slow. It is probably simpler to do a translation, $y_i = x_i - a_i$, $m = n - \sum_i a_i$. This way you only have the upper bound, $y_i \le b_i - a_i$ to worry about. Then you can remove line where you reject the sample if the $x \ge a$ is violated (one can conceive values of $a$ where this rejection would be very inefficient) — becko, Jun 28 '16 at 13:45
@becko if you compare such approach to the one described by me you'll see that they give *different* solutions. — Tim, Jun 28 '16 at 13:57
I don't understand how they can be different? All I did was a change of variables. — becko, Jun 28 '16 at 14:10
@becko your starting point is that all `x[i] >= a`. Imagine you tossed a biased coin with probability of heads = 0.9. You toss the coin until you get at least 10 heads and 10 tails. At the stopping point you'd have on average much more heads than tails. Starting at `x[1] = ... = x[k] = a` means that you ignore the fact that the starting points of each of `x[i]` are different because of different `p[i]`'s. — Tim, Jun 28 '16 at 14:49
I see your point. The only thing I don't like about your solution is that I think it could be very inefficient for particular choices of the parameters. — becko, Jun 28 '16 at 17:18
Can you prove your statement, that the marginals of the truncated multinomial are binomial distributions? — becko, Jun 28 '16 at 18:43
Sorry, prove that the marginals of truncated multinomials are truncated binomials. — becko, Jun 28 '16 at 18:58

mhsnk · Answer 2 · 2019-07-29T03:12:46.323

Here is my effort in trying to translate Tim's R code to Python. Since I spent some time understanding this problem and coded the algorithms in Python, I thought to share them here in case people are interested.

Accept-Reject algorithm :

def sample_truncated_multinomial_accept_reject(k, pVec, a, b):
    x = list(np.random.multinomial(k, pVec, size=1)[0])
    h = [x[i] >= a[i] and x[i] <= b[i] for i in range(len(x))]
    while sum(h) < len(h):
        x = list(np.random.multinomial(k, pVec, size=1)[0])
        h = [x[i] >= a[i] and x[i] <= b[i] for i in range(len(x))]
    return x

Direct simulation

def truncated_multinomial_direct_sampling_from_urn(k, pVec, a, b):
    n = len(pVec)
    while True:
        pp = pVec 
        x = [0 for _ in range(n)] 
        while True:
            if sum([x[h] < b[h] for h in range(n)])==1:
                indx = [h for h in range(n) if x[h] < b[h]][0]
                x[indx] = k - sum(x)
                break
            i = np.random.choice(n, 1, p=pp)[0]
            x[i] += 1
            if x[i] == b[i]:
                pp = [pp[j]/(1-pp[i]) for j in range(n)]
                pp[i] = 0 
            if sum(x) == k:
                break  
        if sum([x[h] < a[h] for h in range(n)]) == 0:
            break 
    return x

Metropolis algorithm

def compute_log_function(x, pVec, a, b):
    x_less_a = sum([x[i] < a[i] for i in range(len(pVec))])
    x_more_a = sum([x[i] > b[i] for i in range(len(pVec))])
    if x_less_a or x_more_a or sum(x) != k:
        return float("-inf")
    return np.sum(np.log(pVec)*x - np.array([math.lgamma(h+1) for h in x]))

def sampling_distribution(original, pVec, a, b, step):
    x = copy.deepcopy(original) 
    idx = np.random.choice(len(x), 2, replace=False)
    u = np.random.choice(step, 1)[0]
    x[idx[0]] -= u
    x[idx[1]] += u
    x_less_a = sum([x[i] < a[i] for i in range(len(pVec))])
    x_more_a = sum([x[i] > b[i] for i in range(len(pVec))])
    while x_less_a or x_more_a or sum(x) != k:
        x = copy.deepcopy(original)  
        idx = np.random.choice(len(x), 2, replace=False)
        u = np.random.choice(step, 1)[0]
        x[idx[0]] -= u
        x[idx[1]] += u
        x_less_a = sum([x[i] < a[i] for i in range(len(pVec))])
        x_more_a = sum([x[i] > b[i] for i in range(len(pVec))])
    return x

def sample_truncated_multinomial_metropolis_hasting(k, pVec, a, b, iters, step=1):
    tmp=sample_truncated_multinomial_accept_reject(k, pVec, a, b)[0]
    step = max(2, step)
    for i in range(iters):
        proposal = sampling_distribution(tmp, pVec, a, b, step)
        if compute_log_function(proposal, pVec, a, b) == float("-inf"):
            continue             
        prob = np.exp(np.array(compute_log_function(proposal, pVec, a, b)) -\
                      np.array(compute_log_function(tmp, pVec, a, b)))
        if np.random.uniform() < prob:
            tmp = proposal 
        step -= 1 
    return tmp

For a complete implementation of this code please see my Github repository at

https://github.com/mohsenkarimzadeh/sampling

score 0 · Answer 3 · answered May 08 '21 at 00:52

I have been working on a related problem (particularly, where your lower bounds are 1 and the upper bound is eliminated, and I want to get a pmf rather than a random variate) but it got me thinking about this old question that I'd seen many times in my research.

You can do this by sampling from a series of $K-1$ truncated binomial distributions. The key is that given the count in the first component, say, $x_1$ with $n$ draws in total, the remaining components are independently multinomial distributed with the same probabilities, just with $p_1$ excluded and renormalized, and $n - x_1$ draws. If you write down the pmf for e.g. $X_1 ~ Bin(n, p_1)$ times the pmf for $X_2 \vert X_1 ~ Bin(n-x_1, \frac{p_2}{1-p_1})$ you'll find that it reduces to the expected multinomial when expanding the binomial coefficients.

The key is that when sampling, you need to ensure that constraints on all components are satisfiable. When sampling the first component, $x_1 \in [a_1, b_1]$ must of course be satisfied, but so must $n - x_1 \in [\sum_{i=2}^{K} a_i, \sum_{i=2}^{K} b_i]$. This doesn't guarantee that the individual component constraints are satisfied- just that we don't sample too many or too few from the first component to satisfy the remaining ones. Then for the second component sampling, the same process- but now $n - x_1 - X_2$ must be in the range of the sum from 3 to $K$ of the constraints above. I use lowercase to denote the previously sampled value, upper case for the random variable. So with constraints on both $n - X$ and $X$ in some form at each step, we can combine them into a single constraint on $X$ for the binomial and then sample from the truncated distribution. The question of how to do that is all that's left.

Fortunately, there's two very simple ways to handle sampling from the truncated binomial. Rejection sampling is obviously the easiest- sample from a binomial distribution and reject ones that don't satisfy the constraints. This could be extremely impractical for a truncated multinomial directly- the rejection rate could become very high depending on how much of the mass is truncated. Doing rejection sampling on the sequence of truncated binomial distributions, however, can avoid much of the difficulty especially if you sample from components with the "tightest" constraints first. A significant improvement is that it's turned into a univariate discrete sampling problem, so you can simply draw from a uniform distribution and compare to the CDF. To do this exactly, you need to find the appropriate constraints on whichever $X$ you're sampling from so that its constraints are satisfied and the constraints on all following components are satisfiable, call the lower and upper bounds $c$, $d$. If $f(k; n, p) = P(X = k)$ is the pmf for a binomial without truncation, you can find

$$ P(c \leq X \leq d) = \sum_{k = c}^{d} f(k; n, p) $$

Or, of course, 1 minus the probability of the constraint not being satisfied. Then the truncated pmf is given by $g(k; n, p) = \frac{f(k; n, p)}{P(c \leq X \leq d)}$ and you can sample directly from that.

If you have some components with very restrictive constraints (the probability of them being satisfied in the non truncated distribution is very small) then you will likely find that the exact sampling from $g$ is much faster than rejection sampling. The flip side is that rejection sampling may be much quicker for components with little restrictions, but by breaking the sampling down into one component at a time you can choose which method to use per component.

It's also worth noting that the truncated distributions on each component are still exchangeable and so Gibbs sampling is a potentially appealing option that has the benefit of simplicity. A full round of sampling is the same amount of computation with Gibbs sampling.

A final word of warning

The truncated distribution of a binomial is not going to be the same as drawing $X = Bin(b-a, p) + a$. This does not count the combinations correctly. The probability of drawing a 1 from a zero truncated distribution is $\frac{np(1-p)^{n-1}}{1 - (1-p)^{n}}$ whereas drawing a zero in n-1 and adding 1 has a probability of $(1-p)^{n-1}$. The difference is in the way of counting the permutations. Subtracting the sums of $a_i$ from n and sampling will get you different results because you're missing the number of ways that $a_i$ balls can be put in each urn and the $n - x_1 - a_2 \cdots$ balls are chosen for the subsequent sampling, in terms of the ball and urn analogy.

How to sample a truncated multinomial distribution?

3 Answers3

Linked