Probability of substring inside string of coin-tosses

Question

Let's say we flip a coin $N=100$ times and get "HTHTHTHHHTHHHHTTHTT...HHT". What is the probability that se letter combination "HHHHH" occurs at least once in this string of $100$ letters? If not a precise formula then the best approximation for $N \to \infty$. I find it to be approximately:

$$1-\text{BinomPdf}(N/2,\,(1/2)^5,\,0) \, .$$

A generalization of this question is answered at https://stats.stackexchange.com/questions/26988/probability-of-finding-a-particular-sequence-of-base-pairs/27031#27031. A variant is answered (using techniques applicable here) at https://stats.stackexchange.com/questions/12174/time-taken-to-hit-a-pattern-of-heads-and-tails-in-a-series-of-coin-tosses. — whuber, Aug 16 '18 at 11:50

Ben · Answer 1 · 2020-10-09T06:15:36.350

There are some useful approximations given in related questions here and here. However, in the case of a string of heads it is possible to obtain an exact answer quite easily using Markov chains.

Exact probability via Markov chain: Consider a discrete sequence of coin tosses $\{ x_n | n \in \mathbb{N} \}$. For a given value of $n$, let $\mathscr{W}$ be the event that five consecutive heads have occurred in the finite chain, and let $\mathscr{H}_k$ be the event that the last $k$ coin tosses were heads (and the one before that was not a head). We use these events to give the following partition of six possible states of interest:

$$\begin{matrix} \text{State 0} & & & \bar{\mathscr{W}} \cap \mathscr{H_0}, \\[6pt] \text{State 1} & & & \bar{\mathscr{W}} \cap \mathscr{H_1}, \\[6pt] \text{State 2} & & & \bar{\mathscr{W}} \cap \mathscr{H_2}, \\[6pt] \text{State 3} & & & \bar{\mathscr{W}} \cap \mathscr{H_3}, \\[6pt] \text{State 4} & & & \bar{\mathscr{W}} \cap \mathscr{H_4}, \\[6pt] \text{State 5} & & & \mathscr{W}. \quad \quad \text{ } \text{ } \\[6pt] \end{matrix}$$

Now, assume that the sequence of coin-tosses is exchangeable and let $\theta$ be the probability of a head on a single toss. Your process can be represented as a discrete-time Markov chains that begins in $\text{State 0}$ at $n=0$ and transitions according to the probability matrix:

$$\mathbf{P} = \begin{bmatrix} 1-\theta & \theta & 0 & 0 & 0 & 0 \\[6pt] 1-\theta & 0 & \theta & 0 & 0 & 0 \\[6pt] 1-\theta & 0 & 0 & \theta & 0 & 0 \\[6pt] 1-\theta & 0 & 0 & 0 & \theta & 0 \\[6pt] 1-\theta & 0 & 0 & 0 & 0 & \theta \\[6pt] 0 & 0 & 0 & 0 & 0 & 1 \\[6pt] \end{bmatrix}.$$

The last state is an absorbing state which represents the case where the chain has produced a run of five consecutive heads. For a given value of $n$ the probability of a run of five consecutive heads in the chain is $\mathbb{P}(\mathscr{W} | n) = \{ \mathbf{P}^n \}_{0,5}$. (This probability is zero for all $n <5$ as expected.)

Programming this in R: You can program this as a function in R using the code below. This code has been vectorised to generate an array of powers of the transition matrix for a finite sequence of coin tosses. It also allows you to put in a probability of a single head, so it is not restricted to the case of a fair coin.

#Create function to give n-step transition matrix for n = 1...N
#t is the probability of a head
#N is the last value of n
PROB <- function(N, t) { P <- matrix(c(1-t, t, 0, 0, 0, 0, 
                                       1-t, 0, t, 0, 0, 0, 
                                       1-t, 0, 0, t, 0, 0,
                                       1-t, 0, 0, 0, t, 0,
                                       1-t, 0, 0, 0, 0, t,
                                         0, 0, 0, 0, 0, 1),
                                     nrow = 6, ncol = 6, 
                                     byrow = TRUE);
                         PPP <- array(0, dim = c(6,6,N));
                         PPP[,,1] <- P;
                         for (n in 2:N) { PPP[,,n] <- PPP[,,n-1] %*% P; } 
                         PPP }

#Calculate probability for N = 100 for fair coin
N <- 100;
t <- 0.5;
PROB(N,t)[1,6,N];

[1] 0.8101096

As you can see from this calculation, the probability of getting a string of five consecutive heads in $n=100$ tosses of a fair coin is $0.810196$ (i.e., it will happen a little over one-fifth of the time). This calculation was extremely rapid using the above vectorised code for the Markov chain.

Showing the probability as a function of $n$: It is quite simple and fast to calculate this probability for any given $n$ and so we can easily generate a series of probabilities and show the probability of the event of interest as $n$ increases. The following R code and resulting plot show the probability increasing as we increase the number of coin tosses. We also plot a red dashed line showing the approximation function suggested by Martijn Weterings in the comment below. This shows that the approximation is reasonably close to the true probability curve.

library(ggplot2);

#Create approximation function
APPROX <- rep(0,N);
for (n in 1:N) APPROX[n] <- 1 - (1-t^5/(1+t+t^2+t^3+t^4))^n;

#Create data frame of probabilities up to N
N <- 300;
PPP <- PROB(N,t)[1,6,];
DF <- data.frame(n = 1:N, Probability = PPP, Approx = APPROX);

#Plot probability as a function of n
FIGURE <- ggplot(data = DF, aes(x = n, y = Probability)) +
          geom_line(size = 1) +
          geom_line(aes(y = Approx, colour = 'red'), size = 1, linetype = 'dashed') +
          theme(legend.position="none") +
          ggtitle("Probability of Five Consecutive Heads (Fair Coin)") +
          xlab("Number of Tosses") + ylab("Probability");
FIGURE;

This function approaches a balance between states such that the fraction in state 4 compared to states 0 to 4 is $$\frac{\theta^4}{1+\theta + \theta^2 + \theta^3 + \theta^4}$$ and the relative loss to state 5 is $$\tau(\theta) = \theta \frac{\theta^4}{1+\theta + \theta^2 + \theta^3 + \theta^4}$$ where $\tau(0.5) = 0.01613$ and the (approximated) left over after 100 tosses is $$1-(1-\tau(0.5))^{100} = 0.8032954$$ which is a close guess — Sextus Empiricus, Aug 20 '18 at 15:39
@MartijnWeterings: Thanks for that - I have added your approximation into my code and plot above so that the reader can compare the exact answer with the approximation. — Ben, Aug 20 '18 at 23:34
Thank you for the answer! ...How did you get to the approximation? — user217790, Aug 26 '18 at 10:39

score 1 · Answer 2 · answered Aug 20 '18 at 13:34

1

Just for fun:

substr_prob <- function(flips, string, iter, seed) {
  out <- sapply(1:iter, function(zzz) {
    result <- paste(sample(c("H", "T"), flips, TRUE), collapse = "")
    grepl(string, result, fixed = TRUE)
  })
  return(mean(out))
}

> substr_prob(flips = 100, string = "HHHHH", iter = 10000, seed = 1839)
[1] 0.8084

Technically slow due to using strings and whatnot, but it still takes under a second to run and is user friendly.

answered Aug 20 '18 at 13:34

Mark White

8,712
4
23
61

Thanks, but I still can't see what the probability function is! – user217790 Aug 26 '18 at 17:31

Probability of substring inside string of coin-tosses

2 Answers2

Linked

Related