4

I am trying to figure out expected number the occurrences of a DNA motif in a sequence.

Let me first show a small example for the base case: Assume we have a DNA sequence 'TACCTG', where in every position only one base can occur. The sequence thus should pop up roughly every 4^6 = 4096 positions in a DNA strand with uniform background distribution (every base has prob. of 0.25 of occurring).

My motif is not a deterministic sequence. In position 1 and 8 of the motif multiple bases can occur. My question is, how can I calculate the occurrence probability of such a motif in a sequence of length N?

The motif looks as follows:

enter image description here

Thanks for the answer!

  • If the first position can be C,G, or T with those probabilities and positions 2-7 are all fixed and the eighth position can be A,C,G, or T, then there are only 12 possible sequences. If the base in positions 1 and 8 are independent, then you can find the probability by multiplying the two probabilities. For example CTACCTGG occurs with probability 0.7 * 0.1=0.07. – John L Mar 25 '21 at 14:17
  • Thanks! If i do this for all 12 seqs. I would find the prob. of these sequences. What I am searching for is the number of hits I would expect with this motif in a random sequence of a certain length N. As an example, if I look at the deterministic part (bases 2-7) and let's say the random dna sequence is of length 5000. Now on average I would expect 1 hit of my deterministic motif in a sequence of length 5000. Because at the flanking regions multiple bases are possible and they occur with a certain probability. I can't use the simple formula to calculate the hit probability in a sequence. – Jan Adelmann Mar 25 '21 at 15:37
  • Could you clarify what "number of hits" means? For instance, are there one, two, or three hits of "CC" in the string "CCCC"? – whuber Mar 25 '21 at 15:46
  • 1
    There would be 3 hits in the string. – Jan Adelmann Mar 25 '21 at 15:48
  • One way of finding a probability of a hit could be by treating pos. 1 and 8. as follows: (P_1(C) + P_1(G) + P_1(T))*P_2(T)*.....*P_7(G)*(P_8(C) + P_8(G) + P_8(T)+P_8(A)) = 0.75*0.25^6*1. But I am not sure if this makes to much sense. – Jan Adelmann Mar 25 '21 at 15:50

1 Answers1

3

It appears you want to model the string (of length $n$) as a random sequence of letters in the $d=4$ letter alphabet $\mathcal A=\{A,C,G,T\}$ and that by "random" you mean the letters appear independently with uniform probabilities $1/d$ each.

The sense of "random motif" seems similar. It is also a random sequence of $m$ independent letters from the same alphabet, but with a twist: the probability distributions may vary from position to position as specified.

The meaning of a "hit" was clarified in a comment, but is still ambiguous. I will make it precise by assuming that a single motif $a_1a_2\cdots a_m$ is randomly generated according to probability distributions $\pi_1,\pi_2,\ldots,\pi_m$ and a string $s_1s_2\cdots s_n$ is independently randomly generated (with uniform probabilities at each position). This makes it possible to compute the chance of a match at any position $i=1,2,\ldots, n-m+1$ as

$$p_i = \prod_{j=1}^{m} \Pr(a_j=s_{i+j-1})$$

For any letter $a\in\mathcal A$ let $\pi_i(a)$ be the chance that the letter in position $i$ of the motif is $a.$ This breaks the event $a_j=s_{i+j-1}$ into the union of $d$ disjoint events according to the value of $a_j,$ giving

$$\Pr(a_j=s_{i+j-1}) = \sum_{a\in\mathcal A} \Pr(a_j=a)\Pr(s_{i+j-1}=a) = \sum_{a\in\mathcal A} \pi_j(a) \frac{1}{d} = \frac{1}{d}$$

because the $\pi_j(a)$ must sum to unity. Therefore

$$p_i = \frac{1}{d^m}.$$

Consequently, the expected number of hits is obtained by summing all these chances over all the positions from $1$ through $n-m+1,$ yielding $(n-m+1)/d^m.$


The chance of observing at least one hit is far more difficult to obtain, because the events "there is not a hit at position $i$" and "there is not a hit at position $j$" are not independent when $|i-j|\lt m.$ See https://stats.stackexchange.com/a/27031/919 for a discussion of the case where the probe string is deterministic. Without a thorough analysis like that for every one of the $d^m$ probe strings, a fully accurate answer seems impossible to obtain.

Although the analysis can be simplified by looking only at patterns of probe strings (two strings have the same "pattern" when a permutation of the alphabet sends one string to the other), the number of patterns grows exponentially with probe string length $m.$ For example, let "x" refer to any letter, "y" any letter different than "x," and so on. Then

  1. There is just one pattern x of length-one probe strings.

  2. There are two patterns xx and xy of length-two probes when $d\ge 2.$

  3. There are five patterns xxx, xxy, xyy, xyx, and xyz of probes when $d\ge 3.$

  4. There are 13 patterns xxxx, xxxy, xxyy, xxyx, xxyz, xyxx, xyxy, xyyy, xyxz, xyzx, xyzy, xyzz, and xyyz when $d\ge 4,$

and so on.

We might, however, hope to approximate the answer by redefining what a "hit" is. Instead of generating a probe string and a target string and testing the probe at all positions of the target, we could generate a target and generate a new random probe for each position. This would make the results at each position independent, and now the (easy) solution is that the chance of a hit (in this new sense) is

$$\begin{aligned} \Pr(\text{at least one hit}) &= 1 - \Pr(\text{no hits}) \\ &= 1 - \prod_{j=1}^{n-m+1}(1 - \Pr(\text{hit at position }j)) \\ &= 1 - (1 - d^{-m})^{n-m+1}\\ &\approx 1 - \exp((n-m+1)d^{-m}). \end{aligned}$$

(The approximation increases in accuracy rapidly as $m$ grows.)

Intuition suggests this value is either very close to the correct one or might even equal it exactly, because the dependence of the hit rates on the patterns ought to average out across all the patterns.

whuber
  • 281,159
  • 54
  • 637
  • 1,101