I have a string of length n, composed of 20 characters of equal probability. What is chance of occurrence of a regular expression pattern, like 'WP[^WFHY]{5}W' by chance? In case you are not familiar with python, [^WFHY]{5} means any 5 characters that are not W, F, H or Y.
Furthermore, if I have a database of 17000 sequences. How do I calculate the same probability given that the length of the string varies between each sequence? I assume we we can't concatenate all the sequences as in the first calculation because matches can't occur between sequences.
Lastly, what if the frequency / probability of character occurrence is not equal? How to calculate the frequency / probability of each letter occurrence and factor it into the calculation?
Characters:
A
C
D
E
F
G
H
I
K
L
M
N
P
Q
R
S
T
V
W
Y