1

Imagine you have an array of integer numbers. You need to calculate the probability of encountering some integer sub-array in that list.

For example:

the original list: '1284726437594563495834905834095845634853' which is N integers from 0 to 9.

the question is: what is the probability to encounter '456', order is significant, in that big data set?

my considerations: The are N-2 possible sets of 3 numbers in big list. Each has (1/10)^3 variants. With order significant we need one of them, but any place of the pattern would match. So we have

(1/10)^3 * (N-2)

In case order is not significant, there are n! possible permutations for n items. Thus the formula is

(1/10)^3 * 3! * (N-2)

Am I right?

UPDATE: The digits are distributed randomly, i.e. a possibilities of them are equal. The list is some sequence of these random digits. By "encounter" I mean to have at least one '456' sequence in the given data set.

Looks like my considerations were wrong. Here is a corrected version: A probability to meet at least one sub-array '456': A probability not to meet any is (1 - (1/10)^3) ^ (N-2), so probability to meet one is

1 - (1 - (1/10)^3) ^ (N-2)

A most probable number of patterns to meet is

(N-2) * 1/10^3
KutaBeach
  • 111
  • 4
  • This question is unanswerable without information about how the list is constructed and what you mean by "encounter." For instance, are you asking about the frequency of a particular substring within a string? Or are you asking about the probability that a substring will appear in a string that is generated according to some random process (and if so, what process is it)? A closely related question is addressed at http://stats.stackexchange.com/questions/32531. – whuber Dec 20 '13 at 13:39
  • 1
    Closely related: http://stats.stackexchange.com/questions/26988 (a similar question involving binary strings) and http://stats.stackexchange.com/questions/12174 (the same question, perhaps, involving quaternary strings). – whuber Dec 20 '13 at 14:06

1 Answers1

1

No.

First it can easily be seen that the quantity you computed is not a probability, since it goes to $+\infty$ as $N \rightarrow +\infty$. Actually what you computed is the average number of times you'll observe your pattern.

Computing the exact probability of seeing at last one time the pattern is much more difficult, since there are a lot of dependencies involved (for example, the probability to observe or not your pattern at digits 2-4 is heavily dependent on the fact that you observed or not your pattern at digits 1-3)...

It is possible to derive upper or lower bounds on this probability though.

PS: All this is based on the assumption that the digits are uniformly and i.i.d.

Adrien
  • 478
  • 4
  • 9
  • 2
    This version of the question is asked (for the case of $4$ rather than $10$ digits, but it's the same problem) and answered at http://stats.stackexchange.com/questions/26988/probability-of-finding-a-particular-sequence-of-base-pairs. – whuber Dec 20 '13 at 14:03
  • Adrien, I have corrected my answer. Is it still wrong? Why my approach is not working? – KutaBeach Dec 20 '13 at 15:02