6

Let's say I have a randomly generated sequence consisting of letters A, C, T and G that's 1000 letters long. The probability of each letter occurring is 25%. What is the probability that the sequence 'AAAAA' will occur N times within the 1000-letter sequence?

The problem I have solving this is that the trials are dependent, otherwise this would be modeled nicely using the binomial/Poisson distributions. But if the sequence 'AAAAA' happens to occur at position X, then the probability of it occurring at position X + 1 is 0.25 and not 0.25 ^ 5.

Thank you.

1 Answers1

1

Well, there is a way to get an asymptotic probability that gets closer as the size of the sequence gets bigger. For a 1000-lenght sequence I think it can give you a good approximation.

Call $L_{i}$ the letter in the position $i$.

Consider the Markov Chain $X_{i} = \max(n|L_{i-j}, 0\leq j < n). $

The transition probabilities of that chain are simple:

  • $p(n,0)= 0,75$ for every n between 0 and 5.
  • $p(n,n+1)= 0,25$ for every n between 0 and 4.
  • $p(5,5) = 0,25$

Then you get the equilibrium distribution of the states $\pi_{n}$ by solving $\pi_{n} = \sum_{m} \pi_{m}.p(m,n)$

When you get $\pi_{5}$ you can get approximation by considering that you have a binomial distribution with 1000 tries and probability $\pi_{5}$. It gets better with bigger sequences.

Jundiaius
  • 775
  • 4
  • 12
  • Thank you. Are you sure the result is a binomial distribution? I simulated this on a million random sequences, each 100,000 letters long, and the resulting curve did not fit the binomial distribution. It was indeed bell-shaped (central limit theorem) and the mean was at the place where the binomial would put it but it was shorter and wider. – Dejan Jelovic May 13 '14 at 20:36