Questions tagged [string-count]

Count the number of occurrences of a specified string within a random text.

Statistical problems involving "string-counts" occur when we count the number of occurrences of a specified string (a vector of symbols) within a text (another vector of symbols). Any distribution for an underlying random text induces a corresponding probability distribution for the string-count, and the properties of this distribution are often of interest in statistical problems. Probabilties pertaining to string-counts are of interest in the analysis of DNA code, and in other problems where we wish to determine whether a given string has occurred more than would be expected "at random".

The distribution of the string-count for a random text is closely related to a number of mathematical and statistical subjects, including Markov chains, deterministic finite automata (DFAs), probability generating functions, and recursive computation formulae. Statistical models for string-counts may use finite-order Markov chains for the symbols in the text, or they may involve simpler models (e.g., models where symbols in the text are IID random variables).

The string-count tag is suitable for any problem involving analysis of the number of occurrences of a string of symbols within a random text. It is also suitable for related problems where we look at the number of symbols in a text that are needed until the occurrence of a string. Note that the "string-count" is closely related to "runs" of symbols in a text.

15 questions

votes

5 answers

Time taken to hit a pattern of heads and tails in a series of coin-tosses

Inspired by Peter Donnelly's talk at TED, in which he discusses how long it would take for a certain pattern to appear in a series of coin tosses, I created the following script in R. Given two patterns 'hth' and 'htt', it calculates how long it…

r probability stochastic-processes string-count

asked Jun 21 '11 at 17:22

lafrasu

votes

4 answers

Probability of finding a particular sequence of base pairs

Thinking about probability always makes me realize how bad I am at counting... Consider a sequence of $n$ base letters $A,\; T, \; C, \text{ and } G$, each equally likely to appear. What is the probability that this sequence contains a particular…

probability combinatorics string-count

asked Apr 23 '12 at 21:59

Charlie

13,124
5
38
68

votes

4 answers

Check if a character string is not random

Background Let's say we have an alphabet of A,B, C, D, then we look through some data and find a "word" which is DDDDDDDDCDDDDDD the chance of finding this random seems low to me whereas finding BABDCABCDACDBACD seems less random. Question How…

text-mining randomness string-count

asked Oct 10 '18 at 12:35

CodeNoob

votes

2 answers

Probability of letters occurring in order in a string

Suppose we have an alphabet containing $m+1$ symbols, $\{a, b, c, d, e,..., \$\}$, where $p = \Pr(a) = \Pr(b) =\cdots$, and $\Pr(\$) = 1 - (\Pr(a)+\Pr(b)+\cdots)=1-mp$. For a random string of length $n$, what is the probability that the letters ${a,…

probability combinatorics string-count

asked Jun 04 '15 at 20:57

Andrew W

votes

2 answers

Stuck on Penney's game question

Suppose we have a game that assumes we have a fair coin. Two people who are playing can pick two patterns of coin flips. The players then flip a large coin a number of times and record each of the individual flips. After the flips, they pay…

probability string-count

asked Sep 15 '21 at 21:08

Priana

votes

1 answer

How many times will a 12 letter word appear in a string of 1 quintillion random letters?

Let's say we think of a specific $12$ letter word. How many times on average will that particular word appear in a string of $1$ quintillion ($10^{18}$) completely random letters (i.e., uniform/equal probability for each letter A-Z at each…

probability combinatorics string-count

asked May 20 '16 at 09:31

Kelvin

1,051
9
18

votes

1 answer

What is a good way of testing for a relationship between two count variables?

I have counts of occurrences of two types of words (A and B) in several texts. What I would like to test is whether the frequencies of occurrence of both types of words across texts is 'correlated'. However, using Pearson's correlation is probably…

correlation count-data poisson-regression string-count

asked Jul 11 '14 at 15:32

Pavel

votes

1 answer

What is the probability for an N-char string to appear in an M-length random string?

Link shortening service bit.ly allows you to, as you might expect, shorten URLs. URLs get shortened using a 7-character string. The alphabet of this string consists of a-z, A-Z and 0-9. Today, Dutch police have used bit.ly for a tweet about the…

probability combinatorics string-count

asked Oct 12 '17 at 17:00

Laurens Jansma

votes

1 answer

What is the expected number of occurrences of a 9-mer in 500 random DNA strings, each of length 1000?

I have started learning bioinformatics. There are some matter of finding expected value. But I think I am very weak in calculating such types of things. As expected value is related to statistics, its explanation is skipped in bioinformatics. So, I…

self-study expected-value string-count

asked Aug 19 '16 at 18:08

Enamul Hassan

votes

0 answers

calculating the statistical significance of overrepresentation of a substring in a string

I have a sequence of characters in a long string like this 'ATCGCGCGCGATCGACGCGTACGTCGGATCTA.....' And I know that for example the substring 'ATCG' has been repeated X times in this string, How could I statistically compute if this number is…

chi-squared-test binomial-distribution string-count

asked Feb 03 '15 at 08:10

user3015703

vote

0 answers

What is the probability of regex pattern occurrence in a string of length n?

I have a string of length n, composed of 20 characters of equal probability. What is chance of occurrence of a regular expression pattern, like 'WP[^WFHY]{5}W' by chance? In case you are not familiar with python, [^WFHY]{5} means any 5 characters…

statistical-significance biostatistics bioinformatics string-count

asked Oct 04 '19 at 14:42

Ahmad Khalifa

vote

2 answers

Probability of substring inside string of coin-tosses

Let's say we flip a coin $N=100$ times and get "HTHTHTHHHTHHHHTTHTT...HHT". What is the probability that se letter combination "HHHHH" occurs at least once in this string of $100$ letters? If not a precise formula then the best approximation for $N…

combinatorics string-count

asked Aug 16 '18 at 09:45

user217790

vote

0 answers

Approximate string matching with uncertainty

Background: I have an interesting little problem where I need some sort of metric to compare strings (or list of numbers of whatever), with the added caveat that the base/ground string (the thing which I am comparing against) is somewhat (but not…

entropy information-theory mutual-information string-count

asked Mar 21 '17 at 11:56

Astrid

vote

1 answer

Given a string of 10 digits what is the chance of a random 10 digit string matching with an edit distance of 2

Given a string of 10 random digits in the range 0-9 what is the probability of another random 10 digit string (in the same range) matching where a match can have an edit distance of 2. By edit distance I mean that a string could be said to match if…

probability string-count

asked Sep 06 '16 at 16:02

HAL

votes

0 answers

Combinations of substring occurrence in string

Say I have a binary sequence of length N (e.g. HTHHTHTTTTTTH). How do I find the number of ways to get a NON-overlapping substring of length K (e.g. HHTH), where 1<=K<=N? It's possible you can multiple repeats of the substring if it's short enough…

probability categorical-data binary-data combinatorics string-count

asked Nov 08 '15 at 21:30

user284993