8

Suppose we have an alphabet containing $m+1$ symbols, $\{a, b, c, d, e,..., \$\}$, where $p = \Pr(a) = \Pr(b) =\cdots$, and $\Pr(\$) = 1 - (\Pr(a)+\Pr(b)+\cdots)=1-mp$.

For a random string of length $n$, what is the probability that the letters ${a, b, c, ...}$ (not including $\$$), occur in order (not necessarily consecutively)? In other words, the string is of length $n$ and satisfies the regular expression $*a*b*c*\cdots$.

Some clarifications:

I just need the letters to appear in order sometime. So $acbc$ is ok because it contains $abc$ in that order.

I do need all $m$ letters to appear in order.

Letters can be repeated.

Ben
  • 91,027
  • 3
  • 150
  • 376
Andrew W
  • 183
  • 5

2 Answers2

11

That regular expression represents a Markov chain on $m+1$ states corresponding to a start state $s$ and each of the letters. A transition is made from $s$ to $a$, from $a$ to $b$, ..., and from the penultimate letter to the last, always with probability $p$. Otherwise the state remains the same. The final state is an absorbing state: when it has been reached, all letters have been observed in sequence.

In terms of the states $(s, a, b, \ldots)$, the transition matrix is

$$\mathbb{P}_m = \pmatrix{1-p & p & 0 &\cdots & 0\\ 0 & 1-p & p &\cdots & 0 \\ \vdots & 0 & \ddots & p & \vdots \\ 0 & \cdots & 0 &1-p & p\\ 0 & 0 & \cdots & 0 & 1 }$$

Standard linear algebraic techniques (the Jordan normal form of $\mathbb{P}_m$ and its change of basis matrix are simple and sparse, making this fairly easy to do) establish that for $n\ge m$ the last entry in the first row of the matrix power $\mathbb{P}_m^n$ is

$$\mathbb{P}_m^n(1,m+1) = p^m \sum_{i=0}^{n-m} \binom{m-1+i}{m-1}(1-p)^i.$$

This is the chance of reaching the absorbing state from the start state after $n$ transitions: it answers the question. If you like, it can be expressed in "closed form" in terms of a Hypergeometric function as

$$\mathbb{P}_m^n(1,m+1) =1-p^m \binom{n}{m-1} (1-p)^{-m+n+1} \, _2F_1(1,n+1;n+2-m;1-p).$$

The sum has a pleasant combinatorial interpretation. Let $m+i$ be the position at which the last letter first occurs. It is preceded by a (possibly empty) sequence of non-$a$s, each with a $1-p$ chance of occurring; then an $a$ with a $p$ chance of occurring; then a (possibly non-empty) sequence of non-$b$s, etc. There are $\binom{m-1+i}{m-1}$ locations at which to place the first appearance of an $a$, then the first appearance of a $b$ after that, etc. Thus--including the first appearance of the last letter in position $m+i$--the probability is $\binom{m-1+i}{m-1}p^m(1-p)^k$. This gives one term of the sum. Thus, the sum breaks up the sequences according to where the last letter first occurs, which can be anywhere from position $m+0$ through $m+(n-m)$--these are obviously disjoint--and adds up their probabilities.

As a simple example to clarify the interpretation, suppose $m=2$ and consider $n=3$. There are four sequences of three symbols, each of probability $p^3$, and three other sequences of probability $p^2(1-2p)$, in which the symbols $a$ and $b$ appear in order:

$$aab, aba, abb, bab; ab\$,a\$b, $ab.$$

The chance therefore is

$$4p^3 + 3p^2(1-2p) = 3p^2 - 2p^3 = p^2(3-2p) = p^2(1 + 2(1-p)) = \mathbb{P}_2^3(1,3).$$

The combinatorial interpretation is that the regular expression ^ab (with $b$ in position $2$) occurs with probability $p^2$; and ^[^a]*a[^b]*b, with $b$ in position $3$, occurs in two ways as ^a[^b]b and ^[^a]ab, each with probability $p^2(1-p)$.

whuber
  • 281,159
  • 54
  • 637
  • 1,101
0

By "Letters can be repeated" you mean that abbc is a valid string? They 'appear in order'?

If not, $1 - (1-p^m)^{n-m+1}$ seems to be the answer for me. $1-p^m$ is the probability that in a given space of $m$ characters there is no such combination, then you extend it to all $n-m+1$ possible spaces of $m$ characters

If yes then you have a lower bound

gsmafra
  • 257
  • 2
  • 10
  • This formula does not agree with complete enumeration of cases when $m$ and $n$ are small, so it cannot be generally correct. – whuber Jun 04 '15 at 22:17