6

I came across a question as below

Assume a series includes integer from 1 to N. Every time one samples an integer without replacement from the series. The process continues if $X_{n}$ >= $X_{n-1}$, and $X_{n}$ will be saved into another series $X_{s}$. The process won't stop until $X_{n}$ < $X_{n-1}$.

Then it asks the expected length of $X_{s}$.

Below I tried to simulate the output in Python


import numpy as np

def gen():

    '''
    return length of X_s
    '''

    N = 10000

    raw_data = list(np.arange(N))  #1,2,3,...,N
    X_s = []

    last_value = - float('inf')
    for _ in range(N):
        cur_value = np.random.choice(raw_data)
        raw_data.pop(cur_value)
        if cur_value >= last_value:
            last_value = cur_value
            X_s.append(last_value)
        else:
            break
    return len(X_s)    

a = [gen() for _ in range(1000)]  # simulate 1000 times
np.mean(a)

The result is around 1.6~1.7. I am looking for a closed form solution for the expected length, any thoughts?

Avraham
  • 3,182
  • 21
  • 40
Bratt Swan
  • 653
  • 2
  • 7
  • 12
  • 2
    See the analyses in answers at https://stats.stackexchange.com/questions/550847, which concerns a more complicated version of this question. BTW, (1) the process must stop after $n$ observations no matter what and (2) since you have to observe at least two values before stopping, the expectation *must* be $2$ or greater whenever $n\gt 1.$ – whuber Dec 27 '21 at 16:14
  • 1
    Also closely related to [this question](https://stats.stackexchange.com/questions/350923/). – Ben Feb 08 '22 at 10:35

1 Answers1

11

Let $K$ be the random variable given by the length, so that $1\le K \le n.$ Its survival function is

$$S(k) = \Pr(K \gt k).$$

The event $K\gt k$ can be characterized as $X_1 \lt X_2 \lt \cdots \lt X_k.$ Since all $k!$ possible orderings are equally likely with random sampling, this event has a probability $1/k!.$ Thus

$$S(k) = \frac{1}{k!}, \ k = 1, 2, \ldots, n-1.$$

Trivially, $S(0) = 1$ and $S(k) = 0$ for integral $k \ge n$ (because the sequence $(X_i)$ must stop after $n$ observations: there's nothing left to sample). This simple formula describes the entire distribution of $K.$

According to the general formula for the expectation of a non-negative integral variable $E[K] = \sum_{k=0}^\infty S(k),$ the answer is

$$E[K] = 1 + 1 + 1/2 + 1/3! + \cdots + 1/(n-1)!.$$

For large $n$ this is extremely close to, but less than, $e = \exp(1) \approx 1 + 1.71828\ldots.$ This latter value (one less than $E[K]$) is likely the number your simulation was estimating.

Here is an R simulation that tracks $K$ for many samples and (when $n \gt 2,$ because for $n \le 2$ the length is always $n$) performs a chi-squared test to compare the observed distribution to this calculation:

n <- 3
s <- tabulate(replicate(1e4, {
  x <- sample.int(n)       # A sample
  d <- diff(x)             # The successive changes
  min(n, which(d < 0) + 1) # The length, including the first drop (if any)
}), n)
if (n > 2) {
  p <- c(-diff(1 / factorial(1:(n-1))), 1 / factorial(n-1)) # Computed distribution
  chisq.test(s[-1], p=p)                                    # (`s[-1]` is always zero)
}

Upon running this I found $5078$ instances where $K=2$ and $4922$ where $K=3.$ The chi-squared statistic has a p-value of $0.12:$ no significant evidence that the formula is wrong. Runs with larger values of $n$ continue to confirm the correctness of the answer.

whuber
  • 281,159
  • 54
  • 637
  • 1,101