Questions tagged [string-count]

Count the number of occurrences of a specified string within a random text.

Statistical problems involving "string-counts" occur when we count the number of occurrences of a specified string (a vector of symbols) within a text (another vector of symbols). Any distribution for an underlying random text induces a corresponding probability distribution for the string-count, and the properties of this distribution are often of interest in statistical problems. Probabilties pertaining to string-counts are of interest in the analysis of DNA code, and in other problems where we wish to determine whether a given string has occurred more than would be expected "at random".

The distribution of the string-count for a random text is closely related to a number of mathematical and statistical subjects, including Markov chains, deterministic finite automata (DFAs), probability generating functions, and recursive computation formulae. Statistical models for string-counts may use finite-order Markov chains for the symbols in the text, or they may involve simpler models (e.g., models where symbols in the text are IID random variables).

The string-count tag is suitable for any problem involving analysis of the number of occurrences of a string of symbols within a random text. It is also suitable for related problems where we look at the number of symbols in a text that are needed until the occurrence of a string. Note that the "string-count" is closely related to "runs" of symbols in a text.

15 questions
30
votes
5 answers

Time taken to hit a pattern of heads and tails in a series of coin-tosses

Inspired by Peter Donnelly's talk at TED, in which he discusses how long it would take for a certain pattern to appear in a series of coin tosses, I created the following script in R. Given two patterns 'hth' and 'htt', it calculates how long it…
lafrasu
  • 403
  • 1
  • 4
  • 6
13
votes
4 answers

Probability of finding a particular sequence of base pairs

Thinking about probability always makes me realize how bad I am at counting... Consider a sequence of $n$ base letters $A,\; T, \; C, \text{ and } G$, each equally likely to appear. What is the probability that this sequence contains a particular…
Charlie
  • 13,124
  • 5
  • 38
  • 68
8
votes
4 answers

Check if a character string is not random

Background Let's say we have an alphabet of A,B, C, D, then we look through some data and find a "word" which is DDDDDDDDCDDDDDD the chance of finding this random seems low to me whereas finding BABDCABCDACDBACD seems less random. Question How…
CodeNoob
  • 201
  • 2
  • 7
8
votes
2 answers

Probability of letters occurring in order in a string

Suppose we have an alphabet containing $m+1$ symbols, $\{a, b, c, d, e,..., \$\}$, where $p = \Pr(a) = \Pr(b) =\cdots$, and $\Pr(\$) = 1 - (\Pr(a)+\Pr(b)+\cdots)=1-mp$. For a random string of length $n$, what is the probability that the letters ${a,…
Andrew W
  • 183
  • 5
6
votes
2 answers

Stuck on Penney's game question

Suppose we have a game that assumes we have a fair coin. Two people who are playing can pick two patterns of coin flips. The players then flip a large coin a number of times and record each of the individual flips. After the flips, they pay…
Priana
  • 83
  • 5
4
votes
1 answer

How many times will a 12 letter word appear in a string of 1 quintillion random letters?

Let's say we think of a specific $12$ letter word. How many times on average will that particular word appear in a string of $1$ quintillion ($10^{18}$) completely random letters (i.e., uniform/equal probability for each letter A-Z at each…
Kelvin
  • 1,051
  • 9
  • 18
4
votes
1 answer

What is a good way of testing for a relationship between two count variables?

I have counts of occurrences of two types of words (A and B) in several texts. What I would like to test is whether the frequencies of occurrence of both types of words across texts is 'correlated'. However, using Pearson's correlation is probably…
Pavel
  • 262
  • 1
  • 12
3
votes
1 answer

What is the probability for an N-char string to appear in an M-length random string?

Link shortening service bit.ly allows you to, as you might expect, shorten URLs. URLs get shortened using a 7-character string. The alphabet of this string consists of a-z, A-Z and 0-9. Today, Dutch police have used bit.ly for a tweet about the…
3
votes
1 answer

What is the expected number of occurrences of a 9-mer in 500 random DNA strings, each of length 1000?

I have started learning bioinformatics. There are some matter of finding expected value. But I think I am very weak in calculating such types of things. As expected value is related to statistics, its explanation is skipped in bioinformatics. So, I…
Enamul Hassan
  • 133
  • 1
  • 1
  • 11
2
votes
0 answers

calculating the statistical significance of overrepresentation of a substring in a string

I have a sequence of characters in a long string like this 'ATCGCGCGCGATCGACGCGTACGTCGGATCTA.....' And I know that for example the substring 'ATCG' has been repeated X times in this string, How could I statistically compute if this number is…
1
vote
0 answers

What is the probability of regex pattern occurrence in a string of length n?

I have a string of length n, composed of 20 characters of equal probability. What is chance of occurrence of a regular expression pattern, like 'WP[^WFHY]{5}W' by chance? In case you are not familiar with python, [^WFHY]{5} means any 5 characters…
1
vote
2 answers

Probability of substring inside string of coin-tosses

Let's say we flip a coin $N=100$ times and get "HTHTHTHHHTHHHHTTHTT...HHT". What is the probability that se letter combination "HHHHH" occurs at least once in this string of $100$ letters? If not a precise formula then the best approximation for $N…
user217790
  • 11
  • 1
1
vote
0 answers

Approximate string matching with uncertainty

Background: I have an interesting little problem where I need some sort of metric to compare strings (or list of numbers of whatever), with the added caveat that the base/ground string (the thing which I am comparing against) is somewhat (but not…
1
vote
1 answer

Given a string of 10 digits what is the chance of a random 10 digit string matching with an edit distance of 2

Given a string of 10 random digits in the range 0-9 what is the probability of another random 10 digit string (in the same range) matching where a match can have an edit distance of 2. By edit distance I mean that a string could be said to match if…
HAL
  • 13
  • 2
0
votes
0 answers

Combinations of substring occurrence in string

Say I have a binary sequence of length N (e.g. HTHHTHTTTTTTH). How do I find the number of ways to get a NON-overlapping substring of length K (e.g. HHTH), where 1<=K<=N? It's possible you can multiple repeats of the substring if it's short enough…