2

I have a sequence of characters in a long string like this

'ATCGCGCGCGATCGACGCGTACGTCGGATCTA.....'

And I know that for example the substring 'ATCG' has been repeated X times in this string, How could I statistically compute if this number is significantly different from the random? The random expectation of substring could be calculated by computing the frequencies of each character but I am a bit confused whether I should use a chi square test or binomial test or some other test to calculate the statistical significance of difference between observed and expected. If chi square tests how I should calculate the degree of freedom and if binomial how should I calculate the value of 'n' in the binomail formula?

binom.test(x, n, p = 0.5, alternative = c("two.sided", "less", "greater"),
        conf.level = 0.95) 

I appreciate any hint

Ben
  • 91,027
  • 3
  • 150
  • 376
user3015703
  • 151
  • 6
  • 1
    Actually, you cannot compute the expectation just from the frequencies of each character: it depends on the specific substring and on how you count "repetitions." As a tiny example, the string "AA" appears two times within "ATAAAG". For more about this, please consult the related thread at http://stats.stackexchange.com/questions/26988. Then perhaps you can edit this question to clarify these points. – whuber Feb 03 '15 at 16:31
  • You say, "for example". That implies that you may want to do more than one statistical test (for various strings). If so, how many such tests do you envision? – Joel W. Feb 04 '15 at 19:10
  • @JoelW. Actually the string is several thousands long and dozens of substring to test their overrepresentation significance. – user3015703 Feb 05 '15 at 03:39
  • @whuber:the thread is not what I am doing here, I am not calculating the probability of existence of substring in whole string but in each location,and I count my repetitions starting from all location as (n-r+1) locations.what I finally used was like this.For example if the ref is 'ATCATCAGAGAGATC' and I am calculating the overrepresenation of 'ATCA', I considered (n-r+1) (12 in here) as the total number of trials, observed repetions as 'x'(2 in here) and 1/4^(frequcny based prob for each character) as the probability of success in each trial and used the bionmial test. Am I doing it wrong? – user3015703 Feb 05 '15 at 03:44

0 Answers0