3

Link shortening service bit.ly allows you to, as you might expect, shorten URLs. URLs get shortened using a 7-character string. The alphabet of this string consists of a-z, A-Z and 0-9.

Today, Dutch police have used bit.ly for a tweet about the finding of a body of a girl that was missing for two weeks. Unfortunately, the bit.ly string contained the word "Dead": https://twitter.com/PolitieUtrecht/status/918507900452077568.

That got me wondering: what is the probability that this exact 4-character string will appear in a 7-character string (generated with an alphabet of 62 characters)?

Or, more generally, what is the probability that a defined string $\alpha$ with length $S$ appears somewhere in a string $\beta$ with length $M$, with $M$ being a randomly generated string with an alphabet of 62 characters?

At first I thought "7 positions, 62 possibilities" means $62^7$ combinations, but I'm sure that's not right -- that's the possibility for a 7-character string (e.g. the complete string).

What is a proper method for calculating this?

Ben
  • 91,027
  • 3
  • 150
  • 376

1 Answers1

1

Overall, there are $62^M$ different possible strings, because you have $62$ choices for each of the $M$ characters.

How many of these $62^M$ strings contain your prespecified string $\alpha$? Well, for each "hit", we still have $M-S$ characters that we can choose freely, and $\alpha$ can appear in $M-S+1$ different places in the full string. So we have $(M-S+1)\times 62^{M-S}$ "hits".

Dividing, we get a probability of

$$ \frac{(M-S+1)\times 62^{M-S}}{62^M} = \frac{M-S+1}{62^S}.$$

When $M = 7$ and $S = 4$, the probability is 1 in 3,694,084.

(Of course, that doesn't account for the fact that the effect would have been the same if the random string had contained similar words like "killed" or "corpse", or simply a different capitalization of "Dead".)

Kodiologist
  • 19,063
  • 2
  • 36
  • 68
Stephan Kolassa
  • 95,027
  • 13
  • 197
  • 357
  • Let's try this in a simpler situation; say, searching for a string of length $S=2$ in a random string of length $M=3$ where the alphabet has only two characters "0" and "1" rather than $62.$ It looks like your formula would be $\frac{M-S+1}{2^S}=\frac{3-2+1}{2^2}=\frac{1}{2}.$ Apply this, for instance, to the string "00." Only the strings "000", "001", and "100" contain this substring, so the correct answer for the example must be $3/8.$ I believe something is wrong in your analysis and it derives from hidden, not always correct assumptions about the details of the target string $\alpha.$ – whuber Mar 03 '22 at 21:22