5

I'm working with biological sequence data where each position in the sequence has an associated continuous value. I'm ignoring the sequence content so the data is very similar to a time series with measurements at discrete timepoints -- all values are equally spaced. I would like to be able to detect whether high values tend to cluster together (occur in runs) so I applied the Wald-Wolfowitz runs test for non-random placement of values >1.

There are some issues with that approach:

  • Wald-Wolfowitz works on binary data so I have to binarise the continuous values I have (everything larger than 1 becomes 1 and the rest is 0). Ideally I would like to be able to detect features such as runs of similar values (let's say 10 values of 0.5 in a row) as well. I would imagine there are some methods that would operate on continuous values (e.g. based on autocorrelation) but couldn't find any.

  • While I get a measure of clustering (the test p-value), I don't know which parts are actually clustered or how many clusters there are.

  • I would also like to extend this approach to 3D (mapping of sites on the protein structure) and the test doesn't support multiple dimensions, either.

I was wondering if there are more sophisticated statistical approaches that I could apply?

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
Greg Slodkowicz
  • 405
  • 1
  • 5
  • 10
  • 1
    Compare http://stats.stackexchange.com/questions/67571/how-can-i-group-numerical-data-into-naturally-forming-brackets-e-g-income – Nick Cox Aug 22 '13 at 13:17
  • Thanks, but (as I see it) the thread you linked to is about addressing a slightly different question. My datapoints are in sequence whereas in the other case it's just as a 'bag' of unordered values that are then clustered. – Greg Slodkowicz Aug 28 '13 at 13:26
  • Absolutely not so. The posting and the references spell out that time, space and other sequences are all grist for the mill. Fisher's original example was a time series. Also, where the sequence comes from is immaterial to how you cluster it. What is excluded is clustering on two dimensions, for which you will certainly need quite different methods. – Nick Cox Aug 28 '13 at 16:25
  • Perhaps I'm missing the connection here. I didn't say that it matters where they come from, just that I believe the structure of my data is different. The income data in the example you linked can be reordered (hence my 'bag' metaphor) after which the presence of 'subdivisions' is assessed. My sampling points are constant (one continuous value for each integer in some range) and it doesn't make sense to reorder them. One way to phrase my problem would be that I'm interested in detecting presence of runs of similar values. – Greg Slodkowicz Aug 28 '13 at 17:17
  • Detecting runs of similar values is largely what the technique is all about. The references given are all very clear. – Nick Cox Aug 28 '13 at 17:22
  • Perhaps the references are relevant, the examples given in the thread are not, as far as I can tell. – Greg Slodkowicz Aug 28 '13 at 17:28
  • I don't know where the negativity comes from here. The initial comment was just a suggestion to "compare". The example in my posting was necessarily geared towards the question it answered, but my posting explained at length that there were other applications and there are references for further reading. Sorry, but we all ration our time, and I don't at present have time to rewrite an answer to another question to spell out its implications for yours. – Nick Cox Aug 28 '13 at 17:36
  • I'm not being intentionally negative. I was just trying to avoid the question being marked as resolved when I believe that methods you're referencing don't apply -- at least not directly. – Greg Slodkowicz Aug 28 '13 at 17:45
  • @NickCox Having thought and researched this some more, you're misunderstanding the problem. – Greg Slodkowicz Apr 18 '14 at 19:29
  • That seems singularly useless as a comment, either to me personally or to anyone else, unless you explain what's wrong and what's right. – Nick Cox Apr 18 '14 at 22:15

0 Answers0