Probability of n-bit sequence appearing at least twice in m-bit sequence

Question

Lets assume that we have a pattern $\alpha$ of bits of length $n$. Then I wish to know what the probability is of $\alpha$ appearing on a string of bits of length $m$ at least twice (where $m > n$), assuming that bit 1 and bit 0 are equiprobable (bin distribution).

So for example, lets assume $n=4$ and the sequence $\alpha$ is 0111, and $m=10$. So a valid ocurrence would be 011101010111, where we would have an $\alpha$ at the beginning and one at the end. Of course there are many more, but how can I calculate this probability depending on $n$ and $m$ ?

I reckon this also depends on how $\alpha$ looks like, since if for example $\alpha=$ 00010000, then nothing speaks against starting another ocurrence of $\alpha$ in the last 3 bits of the "old" $\alpha$, as the sequence some coinciding bits at the beginning and end.

EDIT (to add comment as part of the question requirements)

An upper bound analysis for the error is also a valid answer. Intuition with simulations in R, Python or Matlab to arrive to that bound are also valid. I am not looking at ridiculous cases where $n$ is very small, say $\alpha$=01, but instead assume that $n$ is in the ballpark 10-20 and $m$ is a handful of hundreds (300-500).

EDIT2 changed twice for at least twice above, which was desired from the beginning.

Draw the Markov chain (it has $2n+1$ states), then evaluate the probability of being in the goal state of the Markov chain after $m$ transitions. — Neil G, Mar 17 '16 at 09:06
Can the occurences of the patterns overlap? Define a finite state machine which counts the number of occurences of the pattern. — kjetil b halvorsen, Mar 17 '16 at 11:45
@kjetilbhalvorsen yes, that is actually what I tried explaining in my last paragraph. The 2 ocurrences can overlap, depending on how the patter $\alpha$ is — titus.andronicus, Mar 17 '16 at 13:31
I think in general this is very difficult since the probability will depend on the actual pattern generated. Can you estimate this using a simulation or do you need an analytical solution as a function of $n$ and $m$? — dsaxton, Mar 21 '16 at 13:40
Do you want the probability of *exactly* two appearances or *at least* two appearances? Note that this question is closely related to [Penney's Game](http://stats.stackexchange.com/questions/12174). — whuber, Mar 24 '16 at 16:49
Further to whuber's link to a previous thread on SE discussing similar issues, Wikipedia has an article devoted to [Penney's game](https://en.wikipedia.org/wiki/Penney%27s_game). (The main extra point that's brought out in the wiki article is the non-transitivity of the game, which is not the key issue either here or in the linked thread, but is nonetheless a very interesting result.) A piece that discusses Conway's algorithm for calculating the winning odds is [here](https://web.archive.org/web/20150423224341/https://plus.maths.org/content/os/issue55/features/nishiyama/index). — Silverfish, Mar 26 '16 at 19:47

josliber · Accepted Answer · 2016-03-26T15:03:03.527

As @NeilG stated in the comments, the desired probability can still be computed exactly by defining a (2n+1)-state markov chain and computing the probability of having seen 2 copies of $\alpha$ after m iterations. The states in the markov chain will be of the form $(a, b)$, where $a$ is the number of times we have already seen $\alpha$ (0, 1, or 2), and $b$ is our progress in seeing $\alpha$ at the current position. For the small example provided with pattern 0111, the states are:

\begin{align*} (0,&~-) \\ (0,&~0) \\ (0,&~01) \\ (0,&~011) \\ (1,&~-) \\ (1,&~0) \\ (1,&~01) \\ (1,&~011) \\ (2,&~-) \\ \end{align*}

The transition probabilities are defined quite naturally. For pattern 0111, if we see a 0 from state $(0,~-)$ then we transition to $(0,~0)$ and otherwise we stay in $(0,~-)$. If we see a 1 from state $(0,~0)$ then we transition to $(0,~01)$ and otherwise we stay in $(0,~0)$, as we still have the first 0 in the pattern despite not getting a 1. If we see a 1 from state $(0,~01)$ then we transition to $(0,~011)$ and otherwise we go back to state $(0,~0)$. Finally, if we see a 1 from state $(0,~011)$ then we transition to $(1,~-)$, and otherwise we go back to $(0,~0)$. State $(2,~-)$ is an absorbing state.

This cleanly handles overlapping patterns. If we were searching for pattern 00010000, then upon getting a 0 from $(0,~0001000)$ we would transition to $(1,~000)$.

Computing the transition probabilities and iterating the markov chain from the initial state of $(0,~-)$ can be implemented without too much trouble in your favorite programming language (I'll use R here):

library(expm)
best.match <- function(pattern, partial) {
  to.match <- sapply(seq_len(nchar(pattern)+1), function(s) substr(pattern, s, nchar(pattern)))
  matches <- match(to.match, partial)
  matches <- matches[!is.na(matches)]
  partial[matches[1]]
}
get.prob <- function(alpha, m) {
  n <- nchar(alpha)
  partial <- sapply(0:(n-1), function(k) substr(alpha, 1, k))
  state.match <- rep(0:2, c(n, n, 1))
  state.pattern <- c(partial, partial, "")
  mat <- sapply(seq_along(state.match), function(i) {
    this.match <- state.match[i]
    this.pattern <- state.pattern[i]
    if (this.match == 2) {
      as.numeric(state.match == 2)  # Absorbing state
    } else {
      rowMeans(sapply(paste0(this.pattern, c("0", "1")), function(new.pattern) {
        if (new.pattern == alpha && this.match == 0) {
          as.numeric(state.match == 1 &
                     state.pattern == best.match(substr(new.pattern, 2, nchar(new.pattern)),
                                                 partial))
        } else if (new.pattern == alpha && this.match == 1) {
          as.numeric(state.match == 2)
        } else {
          as.numeric(state.match == this.match &
                     state.pattern == best.match(new.pattern, partial))
        }
      }))
    }
  })
  tail((mat %^% m)[,1], 1)
}

You can invoke the function by passing the pattern (as a string) and the number of iterations m:

get.prob("0111", 10)
# [1] 0.0234375
get.prob("00010000", 200)
# [1] 0.177094

Though this is not a closed form solution, it does give the exact desired probabilities, so it can be used to evaluate the quality of any other bounds that are derived.

+1, especially for working out the hardest part of the solution explicitly, which is computation of the transition matrix. I suspect a reasonably succinct formula could be found in terms of the data structure typically used in the Boyer-Moore search algorithm. BTW, for larger problems where $m$ is big, diagonalize `probs` first and compute its power directly, rather than iteratively finding the power: not only will it be much faster, it will also have better numerical accuracy. A bonus is that it can suggest accurate asymptotic approximations. — whuber, Mar 24 '16 at 20:25
Nice explanation! Also, all very insightful points, @whuber. The final point is reminiscent of our two different answers to a similar problem somewhere on this site :) — Neil G, Mar 26 '16 at 20:00
This one: http://stats.stackexchange.com/questions/21825/probability-over-multiple-blocks-of-events/60814 — Neil G, Mar 26 '16 at 22:44

score 2 · Answer 2 · edited May 23 '17 at 12:39

Part 1: The occurrences of $\alpha$ are disjoint

First consider the case where the two occurrences of $\alpha$ are non-overlapping. Then we could lay out the m bits as:

******alpha*****alpha****

Basically there are three regions of bits that can take any value -- before the first $\alpha$, between the two occurrences, and after the second occurrence. Let's call the number of bits in these regions $n_1$, $n_2$, and $n_3$. We know $n_1+n_2+n_3 = m-2n$, so for a fixed set of values $(n_1, n_2, n_3)$, there are $2^{m-2n}$ ways we could select these bits. The number of all sets $(n_1, n_2, n_3)$ that are non-negative and sum to $m-2n$ is the number of 3-multipartitions of a set of size $m-2n$, which is ${m-2n+2 \choose m-2n} = \frac{(m-2n+2)(m-2n+1)}{2}$. Thus we approximate the number of configurations with non-overlapping occurrences of $\alpha$ to be $2^{m-2n-1}(m-2n+2)(m-2n+1)$ and the proportion of all $2^m$ configurations to be $2^{-2n-1}(m-2n+2)(m-2n+1)$. Note that this is an upper bound for the number of non-overlapping configurations, because if $\alpha$ also appears in the ****** bits then we will count the same configuration multiple times; we would expect the upper bound to be accurate if $\alpha$ is unlikely to occur frequently within the m bits.

For the simple example with pattern 0111 and $m=10$, this is actually exactly correct, as verified using the get.prob function from my other answer:

approx.part1 <- function(pattern, m) {
  n <- nchar(pattern)
  2^(-2*n-1)*(m-2*n+2)*(m-2*n+1)
}
approx.part1("0111", 10)
# [1] 0.0234375
get.prob("0111", 10)
# [1] 0.0234375

Piece 2: Overlapping occurrences of $\alpha$

We need to further account for the fact that the two occurrences $\alpha$ may be overlapping. To do so, let $S$ be the set of all non-redundant suffixes that could be added to $\alpha$ to achieve a second copy of $\alpha$ at the end. As an example, consider the pattern 0101110101. The suffix 110101 could be added to get another copy of $\alpha$, as could 01110101. For pattern 000000, the only non-redundant suffix that could be added to get a second version of $\alpha$ is 0.

Consider a particular suffix $s\in S$ of length $|s|$. We could add it to the end of the first occurrence of $\alpha$ to get a second occurrence; the remaining $m-n-|s|$ bits can be set in any way desired, and there are $m-n-|s|$ positions to place this object among the $m$ total bits. As a result, the total number of configurations with this suffix are $2^{m-n-|s|}(m-n-|s|)$, which out of the $2^m$ configurations is proportion $2^{-n-|s|}(m-n-|s|)$. Summing over all suffixes, we get the following upper bound expression for the probability of at least two occurrences of $\alpha$:

$$2^{-2n-1}(m-2n+2)(m-2n+1) + \sum_{s\in S} 2^{-n-|s|}(m-n-|s|)$$

This is quite simple to code up in R:

s.sizes <- function(pattern) {
  n <- nchar(pattern)
  suffixes <- sapply(2:n, function(i) substr(pattern, i, n))
  ok.suffixes <- suffixes[suffixes == sapply(nchar(suffixes), function(i) substr(pattern, 1, i))]
  needed <- sapply(nchar(ok.suffixes), function(i) substr(pattern, i+1, n))
  if (length(needed) == 0) {
    return(NULL)
  }
  block <- rowSums(sapply(seq_along(needed), function(i) {
    nchar(needed) > nchar(needed[i]) & substr(needed, 1, nchar(needed[i])) == needed[i]
  })) > 0
  nchar(needed)[!block]
}
full.approx <- function(pattern, m) {
  s <- s.sizes(pattern)
  n <- nchar(pattern)
  if (length(s) > 0) {
    2^(-2*n-1)*(m-2*n+2)*(m-2*n+1) + sum(sapply(s, function(i) 2^(-n-i)*(m-n-i)))
  } else {
    2^(-2*n-1)*(m-2*n+2)*(m-2*n+1)
  }
}

With a few examples we can see that it is a reasonable upper bound in cases where we have a low probability of seeing the outcome (recall that get.prob returns the exact correct solution):

full.approx("0111", 10)
# [1] 0.0234375
get.prob("0111", 10)
# [1] 0.0234375

full.approx("0101110101", 50)
# [1] 0.001113892
get.prob("0101110101", 50)
# [1] 0.00108631

However the upper bound can be rather loose in cases where we have a high probability of seeing two occurrences of $\alpha$:

full.approx("00010000", 200)
# [1] 0.3023529
get.prob("00010000", 200)
# [1] 0.177094

Quick question. Why did you write 2 separate answers? Both are very well explained, and I see they follow different approaches, but shouldn't they be combined? Forgive my lack of knowledge, first time with a bounty question ;) — titus.andronicus, Mar 26 '16 at 14:55
@titus.andronicus The other answer computes the exact solution using a markov chain but makes no attempt to yield a closed-form solution. This one is a closed-form solution but is an upper bound. Since they are both pretty long and follow completely different techniques, I chose to post two separate answers. If you're interested in reading more, check out http://meta.stackexchange.com/questions/224772/can-i-post-more-than-one-answer-to-a-question and its duplicates. — josliber, Mar 26 '16 at 14:57

Uri Goren · Answer 3 · 2016-03-25T11:02:23.467

0

If we assume that the two substrings are disjoint

(i.e. $s_1=s[i,i+n]$ $s_2=s[j,j+n]$ such that $0<i<i+n<j<m$),

Then $\alpha$ doesn't matter, (because $0$ and $1$ have equal chance of being chosen)

So lets assume $\alpha$ is all zeros and rephrase the question as:

Given a random binary string of length $m$, what is the probability it has $n$ consequent 0s ?

The number of possible binary strings with $m$ bits is $2^m$

The number of strings with $n$ zeros is approximately $2^{m-2n}\times\binom{n-2m+2}{2}$

constructing a string of length $m-2n$, and then choosing two positions where to append the $n$ zeros.

(This is actually only an upper bound, for the exact number we need to apply Inclusion-Exclusion)

So the probability is $$p\approx\frac{\binom{m-2n+2}{2}2^{m-2n}}{2^m}=\frac{(m-2n+2)(m-2n+1)}{2^{2n+1}}$$

edited Mar 25 '16 at 11:02

answered Mar 24 '16 at 17:44

Uri Goren

1,701
1
10
24

3

On the contrary, the exact $\alpha$ *does* matter, as you will find if you experiment with some possibilities. For instance, `00` has a $1/8$ chance of appearing twice in a three-bit string (it appears twice in `000`), but `10` has *no* chance of appearing twice. – whuber Mar 24 '16 at 18:15
@whuber, you are right, I edited my answer – Uri Goren Mar 24 '16 at 18:42
2

Unfortunately, because you didn't change the answer it's still incorrect! Moreover, your assumption is too strong--it doesn't really capture the essence of the issue. If you would like to understand this better, study the [Boyer-Moore algorithm.](https://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_search_algorithm) – whuber Mar 24 '16 at 18:58

Probability of n-bit sequence appearing at least twice in m-bit sequence

3 Answers3

Part 1: The occurrences of $\alpha$ are disjoint

Piece 2: Overlapping occurrences of $\alpha$

Linked