3

At the bottom of this question is the data of three time series in CSV-format. All are of same length and they all contain measurements of the same event "A". But each time series is using a different method of measurement.

I am looking for a live method to find around 33 upward and downward phases in total (peaks + troughs) for event "A" as the data is coming in. I have tried to implement FFT, Kalman and a median-filter, but have so far been unsuccessful to get close to 33 peaks and troughs (either I get around 5-10 or 60-100). In theory it should be cyclical like a sinus curve but with varying periods, i.e. up -> down -> up -> down... etc.

In the example plot below of the data I have annotated approximately where the peaks (even numbers) and troughs (odd numbers) would be identified:

Example

I am not looking for 33 peaks/troughs exactly, but a consistent live method that attempts to identify peaks/troughs as the data is coming in (e.g. live). Would much appreciate answers that include the code for clarity of method.

Edit1, in response to comment below: Some lag in the method is expected and not a problem.

The data (in CSV-file format):

MA data measurement 1 MA data measurement 2 MA data measurement 3

I have asked a similar question previously, but it was not very well defined. I have tried to be more specific this time.

litmus
  • 23
  • 9
  • 2
    My solution (with code) at https://stats.stackexchange.com/questions/36309/how-do-i-find-peaks-in-a-dataset/36326#36326 works in this case, too, because it employs only local (windowed) procedures. The tuning has to be performed beforehand, perhaps based on previous datasets. The `loess` smooth used there has to be replaced by a rolling version (or, more simply, by a weighted running mean). There's no way, though, you can guarantee that at the end exactly 33 peaks will be detected: that would require accurately predicting the future data. – whuber Jan 09 '22 at 19:35
  • I have a meta-answer: Generate some sample data (eg draw 33 points with the right up-down alternation, connect them with a cubic spline, add some noise and maybe some lags, and repeat this thousands of time), and then try lots of different techniques (eg whuber’s algorithm with different parameters, or variants of what you’ve tried before) and see which techniques work best. – Matt F. Jan 10 '22 at 14:07
  • @whuber I have tried to implement your code but there is something I am doing wrong since it toes not converge to find any peaks, but thanks for the suggestion. Admittedly this is the first time I have used R, but still surprised I could not get the peaks detected. – litmus Jan 23 '22 at 15:51
  • @MattF. thanks for the suggestion, have tried something similar with FFT smoothing but still get a lag that is too big for it to be 'live'. Was really hoping someone had the knowledge to show some working method. Maybe the question is much much harder than I thought.. – litmus Jan 23 '22 at 15:55
  • This does seem like a very hard problem. However, if I were you, I would try Bayesian Changepoint Detection. There are many packages for it in R, I use one called bcp . You can basically "train" it on the first x datapoints, then at each successive point, calculate probability of a changepoint within last y periods (depending on how "live" you need it to be). Then, if prob>threshold, you call it a changepoint. Maybe it won't work on its own, maybe you can feed this data into some other model to help identify the peaks/troughs. I would recommend playing around with BCP a bit, though. Cool algo – Vladimir Belik Jan 25 '22 at 06:35
  • @VladimirBelik thanks for the suggestion. I have actually tried something very similar to that with HMMs. It does work somewhat well but becomes quite computationally heavy. If you're able to get any results with the BCP package please let us know on SE! – litmus Jan 25 '22 at 14:15
  • Ah, makes sense. What is your acceptable lag for these peak/trough identifications (roughly)? Additionally, could you share any info about your Kalman filter attempt? (I've just been curious myself about using Kalman filters for problems like this). – Vladimir Belik Jan 25 '22 at 16:04
  • @VladimirBelik Good question! Of course some lag is acceptable (don't have any specific numerical target, because a lag is to be expected), but as always I hope to minimize lag while maximizing the signal. Sure, regarding the Kalman, I got into trouble when I had to create the model that would link the 3 measurement series. I assumed a linear relationship, but the result was barely better than a moving average of the 3 series, so it did not feel worth it (but there is a possibility I did something wrong since all this is very new to me). – litmus Jan 25 '22 at 19:17
  • Does there need to be a link between the three? Or could you have separate peak/trough detection for the three series? Also, if not a specific number, could you give an order of magnitude for acceptable lag? 100 periods? 1000 periods? – Vladimir Belik Jan 25 '22 at 19:18
  • 1
    @VladimirBelik Nope, no real need, but I hoped it would help increase the signal (sure, separate peak detection for each series would also be an option, but since all 3 are measuring the same thing I hoped that Kalman would be able to remove the noise better with 3 working together). Anything between 100 and 1000 would be acceptable. If the lag is 1000 but it works very well then that would be great (and better than a 100 lag method that is wrong very often). Even if its a bit higher than 1000 it would also be ok if the method is good. – litmus Jan 25 '22 at 19:25
  • 2
    I applied the solution at https://stats.stackexchange.com/questions/36309, **which fully answers the same question,** and it worked beautifully for your first dataset. Using a span of $0.05$ it finds 27 peaks. Applying it to the *negatives* of the values, it finds 27 troughs. With a span of $1/37$ it finds 39 peaks and 40 troughs. The dataset is fairly large (66K+ points), so the calculation takes a couple of seconds. – whuber Jan 26 '22 at 15:59
  • @whuber really glad to hear this! Would you mind giving a short answer where you show this result for us? I am now fairly sure that I must be doing something wrong when I try to implement your code... With span = 0.05 or 1/37, which size W did you use? – litmus Jan 30 '22 at 13:11
  • 1
    I used the values of `w` in the example code. The only thing I did was to replace its simulated data with your actual data. – whuber Jan 30 '22 at 15:35
  • @whuber Ok, thanks for the confirmation and for your input! Like you said it worked well and is surprisingly consistent when changing variables. However the lag is too long to work "live" as peak detection. But still nice to classify past data (I got everything working after fixed an OS system setting). – litmus Feb 01 '22 at 13:04
  • You can always use a shorter window. – whuber Feb 01 '22 at 14:20
  • @whuber That is true. But for example, even changing W from 50 to 30 did not change the sensitivity significantly. I assume that this the down-side of the "consistency" that I commented on before. – litmus Feb 02 '22 at 09:01
  • A change from 50 to 30 is almost inconsequential. If you need huge sensitivity, use a window between 1 and 3. – whuber Feb 02 '22 at 15:17
  • 1
    @whuber thanks for pointing that out! I have now tested a much wider range and W=1, s=0.03 is better and reacts with less lag. Will have to figure out a way to optimize for minimum number of extra points against minimum size of lag. – litmus Feb 03 '22 at 07:36

1 Answers1

1

Although I suggested Bayesian Changepoint Detection in the comments, I tried it out and no luck. It's the wrong method for the job.

However, I think there's a pretty straightforward way to try to accomplish this, inspired by methods in finance to accomplish a similar objective.

Before I go into my solution, I would first like to highlight a central difficulty that makes it really quite hard to get the exact results you're looking for.

Your choice of peaks/troughs is arbitrary, and is somewhat fuzzy. This is possibly unavoidable, but presents a big challenge.

Take a look at 16 and 17 . You chose those two as a peak and a trough, even though the difference between them is smaller than several oscillations earlier in the series which were not selected as peaks/troughs (take a look at the area between 2 and 3, between 5 and 6). Additionally, some of the "peaks" are fuzzy areas/zones like 20, 23 and 28 . They are not nearly as punctuated as the other peaks. My point here is that from what I can tell, the selection of peaks/troughs that you would have liked to identify in a "live" feed do not appear to be chosen based on objective criteria, which makes it harder because some of your peaks are small, others are big, and it's hard to have a method to identify small and big peaks you chose, but not intermediate ones, for example.

I know you have 3 streams of data, so perhaps your desired peak/trough choice is somehow more objective by using all 3 in some way? In any case, this is not a criticism, I am just highlighting an inherent inconsistency/difficulty.

The Proposed Solution

We're going to calculate 2 moving averages, one fast and one slow. We will say that when the fast moving average crosses the slow moving average, a peak/trough has recently occurred.

Here are the results from my attempt at it (blue lines indicate detection of a recent peak/trough): enter image description here Notes:

  1. Look in the area before the blue line to find the actual peak/trough. The blue lines are not the peaks/troughs, they are the points in time at which a recent peak/trough was determined/identified. There are a total of 44 peaks/troughs identified.

  2. There might be some degree of data snooping/leakage, as I adjusted my model while eyeballing the whole dataset . When you are manually tuning the model (I will give below), I would recommend tuning it on one half of the data, for example, and then testing it out on the other half you haven't been staring at to make sure it's not super overfitted.

  3. There is a clearly a time lag, which appears to be more or less within your acceptable limits, but it can be tuned at the cost of false-positives (explained in detail below).

My code in R:

series <- read.csv(file="C:\\...\\MA_meas3.csv", header=TRUE)
library(pracma)

# Fast moving average
ma1 <- movavg(series$X.0.000834974, 500, 'e')

# Slow moving average
ma2 <- movavg(series$X.0.000834974, 525, 'm')

# Plot original along with both moving averages, to get visual intuition of the 
# method
plot(series$X.0.000834974, type='l')
par(new=TRUE)
plot(ma1, type='l', col='red')
lines(ma2, type='l', col='blue')

# Find peaks/troughs by applying basic rules to the two moving averages
index <- c(0)
for (i in 2:length(ma2)){
  if (ma1[i]>=ma2[i] && ma1[i-1] < ma2[i-1] && i > (index[length(index)]+100)){
    index <- c(index, i)
  }else if (ma1[i] <= ma2[i] && ma1[i-1] > ma2[i-1] && i > (index[length(index)]+100)){
    index <- c(index, i)
  }
}

# Plot peak/trough detection
abline(v=index, col='blue')

Explaining the rules within the for-loop:

The loop simply checks whether the two moving averages have just crossed or not, and whether or not this crossing happened at least 100 periods after the previous crossing. The reason I included the 100 periods rule is because without it, there are a few areas where the moving averages cross extremely often, so to thin it out, I thought it's reasonable that you probably won't have a peak and trough within 100 periods of each other, so if that's detected, it's a "false positive" by default.

What you can tune to find a better detection model using the code:

  1. Tune the lookback periods for ma1 and ma2 (I set it at 500 and 525, respectively). The larger the lookback period, the slower moving it is. If you think this is too many false positives and you're willing to accept more lag for less false positives, increase the lookback of ma2 (and vice versa - if you need faster but more errors is fine, decrease it). Please notice that ma1 is an "exponential" moving average, as denoted by parameter 'e' . ma2 is a "modified" moving average, hence the 'm' . There are other options to play around with if you want, just find the documentation for the pracma package, movavg() function. If you decide you only care about really big peaks (like 13 and 14, and not small ones like 16 and 17), you can simply increase the lookback values for both, ma1 and ma2, so it will be less sensitive to small oscillations.

  2. Tune the minimum number of periods since last peak/trough . I set it at 100, but you can increase it (to further decrease number of false positives at the risk of missing a signal), or decrease it (to decrease risk of missing a very fast peak/trough switch, at cost of increasing false positives).

I hope this offers a computationally simple, flexible and reasonably effective method at identifying peaks/troughs in real time.

Vladimir Belik
  • 980
  • 1
  • 10