0

Say I have a binary sequence of length N (e.g. HTHHTHTTTTTTH). How do I find the number of ways to get a NON-overlapping substring of length K (e.g. HHTH), where 1<=K<=N?

It's possible you can multiple repeats of the substring if it's short enough compared to the full string length N. Call the number of repeats M.

One obvious piece of information is the max number of possible repeats (an upper bound on M), is floor(N/K), since e.g. for N=7, K=3, you can only have up to 2 simultaneous occurrences of the substring (2*3 <= 7).

So what I'm looking for is f(N,K,M) such that f gives me the number of ways I can get my length K substring repeated M times in my length N string.

I imagine someone proficient in bioinformatics should be able to solve this pretty easily, as the analogous question for a DNA sequence is a superset of this problem (DNA has A,C,G,T bases instead of just heads/tails, and the basepairs have different frequencies) so it's a more general problem. In fact, if someone wanted to simply derive the analogous results for DNA sequencing, where we have a categorical distribution, that would be acceptable too because it is more general (see e.g. https://en.wikipedia.org/wiki/Categorical_distribution)

Ben
  • 91,027
  • 3
  • 150
  • 376
user284993
  • 21
  • 1
  • What exactly is a "non-overlapping substring"? "Overlapping" would seem to be a relationship among *multiple* substrings, so its meaning when applied to a single substring isn't evident. Generally only can find $N-K+1$ substrings of length $K$ and $\text{int}(N/K)$ non-overlapping substrings of length $K$. Would your question be related to any of [the others about substrings](http://stats.stackexchange.com/search?q=substring)? – whuber Nov 08 '15 at 22:17
  • I'm using non-overlapping substring to mean, e.g. in the full sequence HTHTHTH, with substring HTH, you could count HTH T HTH as containing M=2 repeats of the substring, but you would not count this full sequence as having 3 occurrences of the substring because that would require using the same H's (indices 3 and 5) for more than one substring. From searching for "substring," I believe this question is different enough from all previous questions that they do not contain the answer to this post. – user284993 Nov 08 '15 at 23:33
  • Considering the string `HTHTHTHTHTHTH`, one may view the substrings `HTH` starting at characters 1, 5, and 9 as non-overlapping copies of `HTH` or one may view the substrings `HTH` starting at characters 3, 7, and 11 as non-overlapping copies. Thus there are two distinct ways to find `HTH` three times within this string. Is this value of 2 what you mean by the "number of ways I can get my length 3 substring repeated 3 times in my length 13 string"? – whuber Nov 08 '15 at 23:38
  • What I mean is given full sequence length N (don't know exact sequence), and a length K (don't know exact substring), for a given M = 1,...,floor(N/K), what is # of potential ways to have M repeats. So as a function of M, how many repeats are potentially possible. You're right that the particular sequence and substring determine how many repeats actually occur. But I'm looking for the potential number of appearances, not actual (which is problem dependent). The wrinkle is in not double counting the same situation (e.g. substring=HTH, then HTHHTH and HTHxxx are redundant). – user284993 Nov 09 '15 at 00:28
  • This latest version of your question seems to be answered by my first comment: if you don't know what the particular string and substring are, then there isn't much more to say about the number of "potential" ways to have repeats except to say it cannot exceed $N/K$. I think you might need to describe what you mean by a "way." – whuber Nov 09 '15 at 00:31

0 Answers0