0

I have a large scatterplot, with about 100,000 (x,y) points. The x coordinate is the set of numbers from (1 to ~100,000) - in other words, no 2 points have the same x-coordinate. The y is mostly constant (around 50-70 in value), but there are key "regions" where the y value spikes to ~120 or drops to ~20. How would I statistically differentiate these regions?

Clustering? Any other ideas?

As a bonus, if you know R, if you could reference a certain method that would be helpful as well.

  • Have you considered first a simple correlation analysis? There is a plethora of classification methods out there. If you really want to differentiate regions I would start with goggling classification methods and seeing which ones seem to fit what you are looking for. –  Jul 28 '14 at 17:16
  • 1
    What do $x$ and $y$ represent? Could this be a time series? – whuber Jul 28 '14 at 17:19
  • Yeah, so I am quite new at statistics. I am a high school sophomore and have only taken AP statistics (101 statistics in college, I think) so I have almost no real experience. What kind of correlation analysis? R-squared? – user3855285 Jul 28 '14 at 17:22
  • @whuber It is actually a genomic data problem. X is genomic location on a chromosome and Y is the number of mutations at that location. – user3855285 Jul 28 '14 at 17:23
  • @whuber Does that clarify the problem well enough? – user3855285 Jul 28 '14 at 17:33
  • 4
    Although it is unclear what you mean by "statistically differentiate," it sounds like material on [peak detection in signal processing](http://stats.stackexchange.com/questions/36309) and [change-point methods](http://stats.stackexchange.com/questions/tagged/change-point?sort=votes&pageSize=30) would potentially be relevant and useful. Possibly some related material also appears in [threads referencing "genome"](http://stats.stackexchange.com/search?tab=votes&q=genome). Perhaps, after perusing some of these, you might be able to edit this question to clarify further what you need. – whuber Jul 28 '14 at 20:41
  • 1
    What are you trying to achieve? What is the purpose of "differentiating" these regions of higher or lower y-value? – Glen_b Jul 29 '14 at 00:26

1 Answers1

1

I would suggest starting with a kernel-smoothing method.

This could be used to create essentially a moving average of y over each x that takes into account neighboring values of y, weighted by their distance away from the target x.

You could then set some threshold for identification for regions of x, based on the height and width of this moving average. Something like the average kernel-smoothed value of y is above 75 for at least 5 consecutive units of x.

ryan
  • 11
  • 2