calculating the statistical significance of overrepresentation of a substring in a string

Question

I have a sequence of characters in a long string like this

'ATCGCGCGCGATCGACGCGTACGTCGGATCTA.....'

And I know that for example the substring 'ATCG' has been repeated X times in this string, How could I statistically compute if this number is significantly different from the random? The random expectation of substring could be calculated by computing the frequencies of each character but I am a bit confused whether I should use a chi square test or binomial test or some other test to calculate the statistical significance of difference between observed and expected. If chi square tests how I should calculate the degree of freedom and if binomial how should I calculate the value of 'n' in the binomail formula?

binom.test(x, n, p = 0.5, alternative = c("two.sided", "less", "greater"),
        conf.level = 0.95)

I appreciate any hint

Actually, you cannot compute the expectation just from the frequencies of each character: it depends on the specific substring and on how you count "repetitions." As a tiny example, the string "AA" appears two times within "ATAAAG". For more about this, please consult the related thread at http://stats.stackexchange.com/questions/26988. Then perhaps you can edit this question to clarify these points. — whuber, Feb 03 '15 at 16:31
You say, "for example". That implies that you may want to do more than one statistical test (for various strings). If so, how many such tests do you envision? — Joel W., Feb 04 '15 at 19:10
@JoelW. Actually the string is several thousands long and dozens of substring to test their overrepresentation significance. — user3015703, Feb 05 '15 at 03:39
@whuber:the thread is not what I am doing here, I am not calculating the probability of existence of substring in whole string but in each location,and I count my repetitions starting from all location as (n-r+1) locations.what I finally used was like this.For example if the ref is 'ATCATCAGAGAGATC' and I am calculating the overrepresenation of 'ATCA', I considered (n-r+1) (12 in here) as the total number of trials, observed repetions as 'x'(2 in here) and 1/4^(frequcny based prob for each character) as the probability of success in each trial and used the bionmial test. Am I doing it wrong? — user3015703, Feb 05 '15 at 03:44

calculating the statistical significance of overrepresentation of a substring in a string

0 Answers0