Sum of sample given a priori knowledge of its maximum

Question

Given a sample of discrete random variables $X_1, X_2, \ldots, X_n \sim F$, I am looking to calculate the distribution given by the probability mass function:

$$P\left(\sum_{i=1}^n X_i = x~\middle|~\max_{i=1}^n X_i = a\right)$$

In other words, given that I know the maximum value of the sample, I would like to know the distribution of the sum.

My initial approach to this problem was to define a new distribution $F'$, a truncated version of $F$ supported up to and including $a$, such that we can independently take $Y_1, Y_2, \ldots, Y_{n-1} \sim F'$ and rewrite:

$$P\left(\left(a + \sum_{i = 1}^{n - 1} Y_i\right) = x\right)$$

I was initially expecting this to be an exact solution to the problem, but some inspection seems to indicate that it is not. For small values of $n$, there is a noticable difference in the distribution of my model and the "truth" (gathered by repeatedly drawing samples, and keeping those with the correct maximum).

Here is an example in which $n = 2$, $a = 10$, $F = B(20, 0.5)$, with the blue histogram produced from 10,000 samples:

Is there a way that I can model this distribution exactly? The code used to produce the plot in this question is available in this gist.

A general formula can be derived but without further constraints on the distribution, it will be a mixture requiring $O(nN+1)$ terms where $N$ is the number of values in the support of $F.$ In your example, $nN+1=(2)(21)+1=43.$ For almost all purposes, simulation would be a more effective method of approximating such a distribution. — whuber, Feb 21 '22 at 15:52

score 6 · Answer 1 · edited Feb 21 '22 at 16:12

The problem with your initial approach is that it does not properly deal with the possibility that more than one of the $X_i$ is $a$; it has the tendency to overcount these cases (which tend to have higher sums) and undercount cases where a single case is the maximum (which tend to have lower sums), which is why your blue bars are shifted right to your orange bars.

As a much simpler example, consider $n=2, a=1, F \sim Bin(1,\frac12)$. Given the maximum is $1$, the total is twice as likely to be $1$ as $2$ (cases $1+0, 0+1,1+1$ are equally likely) but your initial approach could suggest wrongly that the totals of $1$ or $2$ are equally likely (cases $1+0, 1+1$).

For your example, it is possible to calculate the possibilities, for example in R:

# Alter these for different distributions or maximum or number of random variables summed

n <- 2                              # number RVs to be summed   
a <- 10                             # maximum allowed
probtrunc <- dbinom(0:a, 20, 1/2)   # truncated pmf

# Calculations
probless <- c(probtrunc[-(a+1)],0)
probsum <- numeric((n+1)*a+1)
probsum[a+1] <- 1
probsumless <- probsum
for (i in 1:n){
  for (j in ((i+1)*a+1):(a+1)){
    probsum[j]     <- sum(probsum[j:(j-a)] * probtrunc)
    probsumless[j] <- sum(probsumless[j:(j-a+1)] * probless)  
    }
  if (i == n-1){
    probearly <- probsum 
    }
  }
probdiff <- probsum - probsumless
results <- cbind(total=a:(n*a), 
      actualprob=(probdiff/sum(probdiff))[(2*a+1):((n+1)*a+1)],
      yourcalc=(probearly/sum(probearly))[(a+1):(n*a+1)] )

giving the results

print(results)
#      total   actualprob     yourcalc
# [1,]    10 1.907349e-06 1.621623e-06
# [2,]    11 3.814697e-05 3.243247e-05
# [3,]    12 3.623962e-04 3.081084e-04
# [4,]    13 2.174377e-03 1.848651e-03
# [5,]    14 9.241104e-03 7.856765e-03
# [6,]    15 2.957153e-02 2.514165e-02
# [7,]    16 7.392883e-02 6.285412e-02
# [8,]    17 1.478577e-01 1.257082e-01
# [9,]    18 2.402687e-01 2.042759e-01
#[10,]    19 3.203583e-01 2.723679e-01
#[11,]    20 1.761971e-01 2.996046e-01

which seems to match your bar chart of your simulations

Jarle Tufto · Accepted Answer · 2022-02-21T14:22:29.597

This is a only partial answer for the case of $n=2$. Let $p(x)$ and $F(x)$ denote the pmf and cdf of $X_1$ and $X_2$. The joint pmf of the minimum and the maximum $X_{(1)},X_{(2)}$ is then clearly $$ p_{X_{(1)},X_{(2)}}(x_1,x_2)=\begin{cases} p(x_1)p(x_2) & \text{for }x_1=x_2 \\ 2p(x_1)p(x_2) & \text{for }x_1<x_2 \end{cases} $$ since $X_{(1)},X_{(2)}$ can take the same value only in one way whereas they can take different values in two ways.

Conditional on $X_{(2)}=x_2$ the pmf of $X_{(1)}$ is thus $$ p_{X_{(1)}|X_{(2)}=x_2}(x_1)=\frac{p_{X_{(1)},X_{(2)}}(x_1,x_2)}{p_{X_{(2)}}(x_2)}. $$ The cdf of $X_{(2)}$ is $$ F_{X_{(2)}}(x_2)=F(x_2)^2 $$ and hence the pmf of $X_{(2)}$ is $$ p_{X_{(2)}}(x_2)=F(x_2)^2-F(x_2-1)^2. $$ The pmf of the sum conditional on $X_{(2})=a$ is then $$ P(X_{(1)}+X_{(2)}=x|X_{(2)}=a)=p_{X_{(1)}|X_{(2)}=a}(x-a) $$ which will agree with your simulations.

Perfect, thank you very much! – StephenSwat Feb 21 '22 at 11:29 — StephenSwat, Feb 21 '22 at 11:29

Sum of sample given a priori knowledge of its maximum

2 Answers2