How can I recognize dramatic changes in a set of observations?

Question

I'm trying to build a monitoring system that will automatically raise a warning when a dramatic change happens to some of the observed parameters.

My problem looks like this: We send out e-mails to a large number of recipients. For each pile of emails, we have a few parameters such as the number of e-mails that were sent as well as engagement counters such as opens and clicks, as well as bounces and unsubscriptions.

Typically, the number of emails sent per mailing would change slightly over time. The engagement ratios might stay more or less constant (accounting for variance, of course), or increase or decrease slowly over time.

Whenever there is a dramatic change in one of those metrics (such as bounce rates going up from 1% to 3%, while having been more or less constant before, or open rates decreasing from 30% to 20% while they were increasing slowly before), I want to be able to recognize this trend change.

I already employ static thresholds, but I want to identify outliers that might suggest a dramatic trend change. Which statistical methods are suited for solving this kind of problem?

Are you familiar with [control charts](http://en.wikipedia.org/wiki/Control_chart)? — AdamO, May 16 '14 at 15:31
you might want to look at http://stats.stackexchange.com/questions/99074/how-do-i-detect-the-number-of-distributions-in-a-set-of-data/99082#99082 as I believe that discussion might be interesting for you. — IrishStat, May 18 '14 at 13:16

Sergio · Answer 1 · 2014-05-16T17:59:02.517

1

Well, I think that you should use control charts, as suggested by AdamO. If you are not familiar with control charts, you could try a naive but simple approach: test if a new value in an influential one, if it "changes the trend". The Cook's distance may help you.

An example in R code:

> set.seed(1234)
> x <- 1:100                  # 100 observations to estimate the trend
> y <- rnorm(100, 10, 1)      # more or less constant values: mean = 10, sd = 1
> range(y)                    # min(y) = 7.65, max(y) = 12.55
[1]  7.654302 12.548991
> ### First scenario
> y_new <- 15                 # 15 is the new value, larger than max(y)
> y <- c(y[2:100], y_new)     # discard the first value, append the new one
> cd <- cooks.distance(lm(y ~ x))
> # Is the new value an influential one?
> cd[100] > 0.50              # a standard threshold
  100 
FALSE 
> ### Second scenario
> range(y)                    # y includes the previous y_new
[1]  7.654302 15.000000
> y_new <- 18                 # 18 is the new, and influential, value
> y <- c(y[2:100], y_new)  
> cd <- cooks.distance(lm(y ~ x))
> cd[100] > 0.50
 100 
TRUE

I've used here 0.50, a standard threshold. If your record of past events is long enough, you can check if it is too low/high wrt your needs.

HTH, even if it really is a naive approach.

edited May 16 '14 at 17:59

answered May 16 '14 at 17:38

Sergio

5,628
2
11
27

I would agree that this is a naive (but cheap !) approach. – IrishStat May 16 '14 at 20:50
;-) Justifiable (perhaps!) just because other approaches are not toys, and because he is not seeking "structural changes" in a series, but a _single_ recent "jump". – Sergio May 16 '14 at 20:55
To me a single recent jump is to me a specific kind of structural change or more generally is one example of a Change Point .... please see http://stats.stackexchange.com/questions/97946/changepoints-in-r/98960#98960 – IrishStat May 16 '14 at 21:44
@IrishStat, I _strongly_ agree with you. I just think that there are not mechanical methods to look for structural changes: you must know what you are doing. – Sergio May 16 '14 at 21:55
I believe that there are "some" mechanical methods. I have implemented these in a piece of commercial software that I have helped develop called AUTOBOX (http://www.autobox.com/cms/ . I have attempted to incorporate the expertise that I have accumulated in my 50 years of time series practice. If I can help you personally feel free to contact me. – IrishStat May 17 '14 at 13:18

score 1 · Answer 2 · edited Apr 13 '17 at 12:44

The procedure called INTERVENTION DETECTION speaks to the issue of empirically detecting a change in mean , a change in trend , a change in seasonal indicators and of course a 1 time change (pulse). Care has to be taken to account for any auto-regressive structure that may be present otherwise it can easily mask the differences. I recently posted on a problem/question similar to this. You should also note that to detect differences in means one MUST first account for any anomalies(1 time pulses) that may be present otherwise you might be unable to"see" the differences in the means. Changepoints in R speaks to Change Point Detection which is what I believe you are after. I would suggest that you make your data available to the list, so I and others might better help you. In my opinion dated procedures like Cook's Distance will be a far cry from what you need due to possible(probable) auto-regressive structure in the data and possible level shifts which can obfuscate the detection/

How can I recognize dramatic changes in a set of observations?

2 Answers2