4

Let's say we think of a specific $12$ letter word. How many times on average will that particular word appear in a string of $1$ quintillion ($10^{18}$) completely random letters (i.e., uniform/equal probability for each letter A-Z at each position, with no preference for vowels or any other letter)?

Also, what will the actual frequency distribution look like? What would be the formula for that PDF (probability density function) distribution?

For simplicity, we should consider that each incidence is non-overlapping, as most words do not start with the sam letter(s) they end with.

Ben
  • 91,027
  • 3
  • 150
  • 376
Kelvin
  • 1,051
  • 9
  • 18
  • Which distribution would you assume on those $10^{21}$ random letters? – Repmat May 20 '16 at 09:58
  • Completely random, with uniform/equal probability for each letter at each position. No preference for vowels or any other letter. – Kelvin May 20 '16 at 10:12
  • 1
    I think you need to define a "word" a little more precisely: is "aaa" a three-letter word or three one-letter words? –  May 20 '16 at 11:17
  • What is a "PDF distribution"? – whuber May 20 '16 at 12:51
  • 1
    @Bey: do I really need to explain what a "12-letter word" is??? – Kelvin May 20 '16 at 15:20
  • 2
    Kelvin, to appreciate @Bey's comment, consider a smaller version of your question where the word has just two letters and the string is just ten letters long. If the word is "aa" and the string happens to be "baaacqxdia", then does "aa" appear once or twice? Your answer matters because (a) it affects the solution and (b) it affects *how* one finds a solution. BTW, in the US one quintillion is $10^{18}$, not $10^{21}$ (and it's $10^{30}$ in the UK). Although clearly the solution method won't depend on the exact value of this number, you ought to remove the inconsistency. – whuber May 20 '16 at 15:38
  • I very much doubt it will matter as we are talking about 12 letters out of 1 quintillion, but let's say that the word is a real one in English, i.e., not "aaaaaaaaaaaa". – Kelvin May 20 '16 at 16:45
  • 2
    Consider the real English word "tot". Does "tot" appear in the string "abtwetototnxwtot" twice or three times? –  May 20 '16 at 16:54
  • For simplicity, we should consider that each incidence is non-overlapping, as most words do not start with the sam letter(s) they end with. – Kelvin May 20 '16 at 19:45
  • 2
    FWIW, a word does not have to start and end with the same letter to create overlaps: it is only necessary that some *suffix* of the word match some *prefix,* as in "ionization" or "tergiversater." – whuber May 20 '16 at 20:16
  • Yes, that's why I said "letter(s)" with an optional plural "s". Meanwhile it would be good to see an answer with less niggling over largely irrelevant definitions and details. At this rate I could have written down a quintillion letters and counted the number. :-/ – Kelvin May 21 '16 at 00:23
  • @Kelvin the prompts in the comments are to make your question more precise so that you receive a correct answer rather then *any* answer. – Tim Nov 14 '16 at 09:34

1 Answers1

1

For any 12 letter group, the odds of being the specific word of interest is $p={\frac{1}{26}}^{12}$.

The odds of finding exactly 1 in $N=10^{21}$ random letters is $(N-12)\cdot p^1 \cdot (1-p)^{N-12}$, as there are N-11 possible starting locations.

The odds of finding exactly 2, assuming the last n letters overlap the first n (n may obviously be 0), is $\frac{(N-11) \cdot (N-23+n)}{2} \cdot p^2 \cdot (1-p)^{N-24+n}$.

In general for z instances of g-letter length word, the discrete pmf is $p(z)=\frac{1}{z!} \cdot \prod_{i=1..z}\left({N-(g-n) \cdot i+1}\right)\cdot p^z \cdot (1-p)^{(N-(z+1) \cdot g+z \cdot n)}$.

Also $p(0)=(1-p)^{(N-g)}$

MikeP
  • 2,117
  • 8
  • 9
  • 4
    Something looks wrong. To check, let's apply your formula to the tractable case of $2$ letters $a$ and $b$, the two-letter word $aa$, and $N=4$. I compute $p=(1/2)^4$ and, for finding exactly one occurrence (not allowing overlaps), I compute $$(N-2)p^1(1-p)^{N-2}=2(1/16)(15/16)^2\approx 0.11.$$ However, exhaustive enumeration gives $$aabb,aaba,baab,abaa,bbaa,$$ each with probability $1/2^4$, for a total chance of $5/16\approx 0.3125$. (It's unclear whether you count instances like $aaab$ as one or two occurrences: they would only add to the chances.) I get an even larger value for $ab$. – whuber May 20 '16 at 19:13
  • 3
    @whuber one thing that's wrong is that he's assuming that the probability of finding a word of interest at each position is independent of every other position. I think we both answered a similar question that appropriately takes that into account? – Neil G May 20 '16 at 19:35
  • @NeilG I searched for such questions when this thread was first posted, but didn't find it. I do remember what you're referring to, though. Do you have a link? – whuber May 20 '16 at 20:04
  • 1
    @whuber, thanks for looking at my answer! I believe that p should be (1/2)^2 for a 2-letter word, which gives us 0.4219, still above 0.3125 (I'll try to track down the issue!) but not so bad as 0.11! In any case, I believe that setting n=0 gets rid of the overlap cases. – MikeP May 20 '16 at 20:20
  • 1
    Sorry--you're right about $p$. I wouldn't expect your solution to be perfect, because it doesn't model the possibility of overlaps correctly, but the *idea* nevertheless ought to lead to an excellent approximation. It would nevertheless be nice to understand the discrepancies that crop up in small examples. – whuber May 20 '16 at 20:23
  • Also, since I am not allowing overlaps (g=0), aaab and baaa would be acceptable as 1 single match, raising the odds to 7/16 = 0.4375 – MikeP May 20 '16 at 20:24
  • The argument at "the odds of finding exactly 2..." is not quite right, due to end effects. Although it's a good approximation, since you seem to be trying to get an exact answer--otherwise why not use a Poisson distribution to obtain a much simpler result?--this is going to be a consideration. – whuber May 20 '16 at 21:29
  • 1
    @whuber Here it is! http://stats.stackexchange.com/questions/202132/probability-of-n-bit-sequence-appearing-at-least-twice-in-m-bit-sequence and this one http://stats.stackexchange.com/questions/21825/probability-over-multiple-blocks-of-events/60814 – Neil G May 20 '16 at 23:12