1

I have a small doubt. My real data looks like this

Y values are random values of integers from 0 to 2000. X values run like 1,2,3,4,5,.. to 2 million.

Now, my task is to identify significant peaks and remove background noise.

To achieve this , I used 2 methods :

I created a sliding window of 50 elements. I divide y[50] by mean of y[1]..to y[49].Now, the modified values are plotted and the significant peaks are recorded.

I created a sliding window of 50 elements. I divide y[50] by y[49] and y[50] by y[48] and so on till y[50] by y[1], and take the average.Now, the modified values are plotted and the significant peaks are recorded.

Which is the best method to use? You can also suggest other methods.

Background info:

The 2 million numbers are the nucleotide positions of a an entire genome. For eg if the nucleotide code is AUGCAUC .. and so on 1st position correspond s to A , 2nd to U and so on. For each of this position,you have a coverage value which is important to predict other things. First I calculate these values, and see if the peaks correspond to the results. After visual inspection I have to empirically choose the cut off value of the peak. Say for eg 100. I am more interested in the method.

The first method does not seem to work out well. If the mean of previous 49 elements is low then the final value shoots up.

The second one seems to be fine. But, I need a good opinion from a mathematician.

  • This question helps a bit but not completely http://stats.stackexchange.com/questions/36309/how-do-i-find-peaks-in-a-dataset – user2711376 Jun 08 '14 at 23:29
  • Could you tell us what criterion you would use to define whether a peak is "significant" or not? It would also help to provide some context: what is the purpose of this peak identification? What would it cost you to fail to identify a "significant" peak? What would it cost to identify something as "significant" that actually arises out of the background noise alone? – whuber Jun 09 '14 at 13:39
  • I can identify the significant peaks by comparing with my biological results. After completing this stastical step I will chose a number by visually comparing it with biological results and set an arbitrary value for the height of the peak ,which is 90 percent significant – user2711376 Jun 09 '14 at 18:22

1 Answers1

1

The idea here is to develop a baseline line and to assess unusual "high" values from the baseline. Your first approach assumes a mean model for each set of 50 contiguous values thus you are assuming a predictive ARIMA model of an AR(50) with equal coefficients (1/50) with no additional constant. The second model/approach is a variety of a random walk model where the values to be analyzed represent ratios for a value at a specific point in time and would not seem to be appropriate given your assumptions regarding the independence of values.

If you say that these values are random values then it is very straightforward to look for unusual (high ) values from the local mean (your first choice). You can employ Intervention Detection procedures to identify the anomalous points. Since the values are independent of each other a straightforward +/- k sigma around the local mean of 50 should work just fine.

IrishStat
  • 27,906
  • 5
  • 29
  • 55
  • The y values are not random. I used the word random to describe my data here . Infact, the y values depend on the biological coding AUGCAUC... so, the values are dependent on the neighbouring x values. My second method, which you have described as a variation of random walk works really well. But, I still do not understand how you came to that conclusion. Some more explanation will be useful .Thank you very much for answering my question at this age of yours. – user2711376 Jun 12 '14 at 11:48
  • A rw model is y(t)=y(t-1)+a(t) . If you divide by y(t-1) you get [y(t)/y(t-1)]=a(t)/y(t-1) or [y(t)/y(t-1)]=e(t) . Your model is very ad hoc and probably can be made more generic as you are assuming a very specific structure. Time Series model identification attempts to empirically identify the structure rather than assuming the structure. Such a data set as you describe might even pose a challenge to "an old guy" like me BUT that is what I enjoy where data suggests the need for improved methodology. – IrishStat Jun 12 '14 at 12:40
  • I understood.My first method that you described as ARIMA is creating a small flaw, which I had already mentioned in the question : If the mean of previous 49 elements(say many zeros and a small number like 1) is low then the final value shoots up because my final value is Y[50]/mean. One more issue is that, I am trying to publish this data in a journal. So, I am not sure if my adhoc approach (2nd method)as you have mentioned will be accepted. Also, I take a mean of of y(t)to y(t-1), y(t)to y(t-2) and so on and I delete non finite values, which arise from 0/0 or just division by zero. THANK U – user2711376 Jun 12 '14 at 14:42
  • Transparency requires validation of assumptions. I am afraid that you will need to step up your assumption checking. Your choice of the word random is an indication of your sophistication and I would suggest a more rigorous approach to model formulation. Perhaps you need to consult a good time series statistician in your community and get their advice. – IrishStat Jun 12 '14 at 15:28