Detect abrupt change in time series

Question

I am trying to detect abrupt change (the "bump") in my data. My end goal is to fit a decline curve that describes the overall trend of a gas well's production rate over time. When fitting my curve, I should not fit my curve from those "bumps", as they are caused by operational issues on the well site. They must be ignored. An ideal curve I want to fit would like like the below green lines:

To achieve this, I'm resorting to outlier detection methods. I want to identify those "bumps" marked by the red boxes, and exclude them.

FYI, the curve is described by the following model:

$$q = \frac{q_i}{(1+bD_it)^\frac{1}{b}}$$

where $D_i$ and $b$ are the parameters I need to fit.

Question: How can I detect those bumps in my time series?

The dips are understandable--a well goes offline for maintenance, etc. But why would you exclude the bumps *upward*? They would seem to reflect the well's capacity, too, and may even contribute a noticeable proportion of its production, but how would they reflect "operational issues"? Shouldn't their contribution be accommodated in the parameter estimates? — whuber, Feb 28 '20 at 19:32
this type of problem is generally referred to as 'changepoint detection' — David Veitch, Feb 28 '20 at 19:39
@whuber my apologies for neglecting explanation as to why the upwards should also be neglected. The upward bumps occur when engineers on field do something to the well to temporarily boost production (such as chemical or water injection into the well). But the production rate shortly (after a few months) falls back to its original state, which is predicted by the "ideal" curve I want to fit. And yet the engineers do it, because it makes them more money within that short period. The upward bumps are caused artificially by field engineers, not by the property of the well. — Eric Kim, Feb 28 '20 at 19:42
@David In a broad sense you're right, but this problem is so special it's difficult to see how changepoint detection methods generally would apply. The special aspects of this situation are (1) that each change rapidly returns to its baseline and (2) there may be a great many such changes. The same concern pertains to applying any method that doesn't account for (1), such as intervention detection methods. — whuber, Feb 28 '20 at 21:58
Eric, It is interesting that you are not fitting $q_i.$ How do you know these values? — whuber, Feb 28 '20 at 22:04
@whuber $q$ is the gas production rate (y-axis). $q_i$ is the initial production rate. In the oil & gas industry, this is usually assumed the maximum value in the production curve. So, the curves are fitted from the first point that stats declining. We ignore the "inclining" portion of the production curve when we first start producing from well (first 1 month). This technique is called (Decline Curve Analysis)[https://petrowiki.org/Production_forecasting_decline_curve_analysis#Hyperbolic_Decline], because a well's production rate monotonically decreases, unless we do something on the well — Eric Kim, Feb 28 '20 at 22:18
@whuber Using this technique, we can predict the Estimated Ultimate Recovery (EUR) - how much oil & gas we will extract for the next 20 ~ 50 years. EUR determines the economic value of a well. — Eric Kim, Feb 28 '20 at 22:23
I figured as much. This method of estimating $q_i,$ although common, is statistically flawed, because it is subject to much more random variation than better estimates based on all the data. I posted a solution that will provide a more accurate estimate for the purpose of EUR. But if you absolutely must use the initial production as if it were known, my solution handles that with almost no modification: you just have to provide your "known" $q_i$ as a value rather than as a parameter to be estimated. — whuber, Feb 28 '20 at 23:37

score 3 · Accepted Answer · answered Feb 28 '20 at 23:32

I would suggest using your knowledge to effect an initial weeding of the low values and then use a robust nonlinear regression method to estimate the curve. Tests with synthetic data indicate this can work extremely well.

Step 1 is to form a rolling maximum of the monthly data. The initial weeding removes all values less than some small fraction of the corresponding maximum, perhaps less than one-tenth.

Step 2 alternates between estimating the scale parameter $1/bD$ using nonlinear least squares and estimating the other parameters using a robust (but ordinary) linear model of $\log(q)$ in terms of $\log(1 / (1 + bDt)).$ The robust method finds the extreme residuals, downweights them in a principled way, and repeats until the results stabilize. This ought to give good starting estimates of the other two parameters--the amplitude $q_i$ and shape $1/b$--for the next iteration.

Step 2 needs an initial estimate. The earliest observation (at month 0) can serve to estimate $q_i.$ For the scale parameter use (say) half the range of months. For the shape parameter using any typical value for your data--maybe $b=1$ would be a good start.

I found the software had a much easier time when employing the logarithm of the amplitude and the square root of the scale for the parameterization: this avoided the risk of using invalid values during the solution.

Here are six datasets based on $q_i=3000,$ $1/(b_iD_i) = 10,$ and $b_i=1,$ constructed to approximate the nastiest example in the question. The brown dots are the data, scaled to show the weights ultimately used in the analysis (set to $0$ for the data screened out at the beginning). The gray curve connects the data. The blue curves are the fits found using this algorithm. Despite the huge amount of variation present in these data, the fits are consistent and correct and take little time: only 4 to 6 iterations of Step 2 were needed to get all parameters estimated to at least six significant figures.

Because this is somewhat ad hoc, obtaining uncertainty estimates for the parameters is a little challenging. An honest (but computationally intensive) method would bootstrap the procedure by resampling from the residuals. However, the variance-covariance matrix returned by the nonlinear fitting procedure should give you a decent sense of the uncertainties.

The R code I used is shown below so you can check the details.

f <- function(x, theta) {
  q <- theta[1]
  a <- theta[2]
  s <- theta[3]
  q / (1 + x / s) ^ a
}
#
# Establish a true data model.
#
q <- 3e3
s <- 10
a <- 1

set.seed(17)
l.X <- lapply(1:6, function(iter) {
  #
  # Generate data.
  #
  x <- 0:110
  y <- f(x, c(q,a,s)) + rt(length(x), 2) * 50
  i <- sample.int(floor(length(x) * 3/4), 20) + 1
  y[i] <- y[i] * rgamma(length(x), 6, 100)
  y <- pmax(1, y)
  #
  # Eliminate low excursions.
  #
  library(zoo)
  w <- 5
  z <- rollapply(ts(c(y[1:w], y, y[length(y)+1 - (1:w)])), 2*w+1, max)
  y.0 <- ifelse(y < 0.1 * z, NA, y)
  #
  # Conduct a robust fit by alternating between estimation of the time scale
  # and robust fitting of the amplitude and shape parameters.
  #
  j <- !is.na(y.0)
  X <- data.frame(t=x[j], y=y.0[j])
  theta <- c(log(max(X$y)), 1, diff(range(X$t))/2)
  weights <- rep(1, nrow(X))
  library(robust)
  for (i in 1:10) {
    #
    # Find the scale.
    #
    fit.nls <- nls(y ~  exp(log.q + log(1 / (1 + t / s^2) ^ a)), data=X, weights=weights,
                   start=list(log.q=theta[1], a=theta[2], s=sqrt(theta[3])))
    s <- coefficients(fit.nls)["s"]
    #
    # Find the other parameters.
    #
    fit <- rlm(log(y) ~ I(log(1/(1 + t/s^2))), data=X)
    beta <- coefficients(fit)
    theta.0 <- c(beta, s)
    weights <- fit$w
    #
    # Check for agreement between the two models.
    #
    if (sum((theta.0/theta-1)^2) <= 1e-12) break
    theta <- theta.0
  }
  cat(iter, ": ", i, " iterations needed.\n")
  cat("Estimates: ", signif(c(q = exp(theta[1]), a=theta[2], s=theta[3]^2), 3), "\n")
  X.0 <- data.frame(t = x, y = y)
  X.0$y.hat <- exp(predict(fit, newdata=X.0))
  X.0[j, "weight"] <- weights
  X.0[!j, "weight"] <- 0
  X.0$I <- iter
  X.0
})
#
# Plot the results.
#
library(ggplot2)
X <- do.call(rbind, l.X)
ggplot(X) + 
  geom_line(aes(t, y), color="#404040") + 
  geom_point(aes(t, y, size=weight), shape=21, fill="#b09000") + 
  geom_line(aes(t, y.hat), color="#2020c0", size=1.25) + 
  scale_size_continuous(range=c(0.25, 1.5)) + 
  # coord_trans(y="log10") + 
  scale_y_log10(limits=c(3e0, 3e3)) + 
  facet_wrap(~ I) + 
  xlab("Month") + ylab("Mean Daily Gas (Mcf)")

I have a few questions. 1) Why are you fitting scale parameter `(1/bD)` and shape parameter `1/b` separately? Can't you just fit all parameters in one non-linear regression model? 2) In your code, `exp(log.q + log(1 / (1 + t / s^2) ^ a))` shouldn't this be `exp(log.q + log(1 / (1 + t / s) ^ a))`, with just 1 as exponent? — Eric Kim, Mar 03 '20 at 22:59
I fit them separately because only the scale parameter requires a nonlinear fitting procedure. Nonlinear fitting can get complicated and tricky, so reducing it from three parameters to one is a gain. This is explained at Step 2. The use of `s^2` to represent the scale parameter guarantees it will be non-negative, thereby avoiding the need to use a constrained minimizer. Notice when the coefficients are printed, the square of `s` (in the form `theta[3]^2`) is what is output. This is explained in the paragraph at "I found the software had a much easier time ..." — whuber, Mar 03 '20 at 23:11
could you also explain how the weights are calculated in nls function? It looks like its not [this](https://en.wikipedia.org/wiki/Iteratively_reweighted_least_squares) — Eric Kim, Mar 24 '20 at 16:15
@Eric The weights are not calculated by `nls`: I supply the final IRLS weights determined by the robust fit. Thus, the robust fit is used in an exploratory manner to screen for outlying residuals and the data corresponding to those residuals are downweighted for the purpose of fitting a curve to the data. — whuber, Mar 24 '20 at 18:45
I was able to implement your code in Python, and it works great. I posted another question related to this topic [here](https://stats.stackexchange.com/questions/461488/changepoint-detection-in-hyperbolic-curve-data). I'm curious if you could please take a look at it — Eric Kim, Apr 19 '20 at 18:36

IrishStat · Answer 2 · 2020-02-28T21:47:41.850

0

This is broadly referred to as Intervention Detection http://docplayer.net/12080848-Outliers-level-shifts-and-variance-changes-in-time-series.html which encompasses not only change point detection (level/step shifts aka abrupt changes )BUT one-time changes (pulse) AND Seasonal Pulses AND deterministic time trend changes .

Some simple approaches ignore any ARIMA structure whilst identifying these 4 kinds of interventions or worse yet ask the user to assume the form of the ARIMA (memory structure) which of course is not clear or identifiable form the original series.

What I suggest that you do is to take the residuals from your KNOWN MODEL and post them in a another question as a NEW y variable so-to-speak and ask that the group (possible contestants) to return a model that reflects both memory and Intervention Effects. The reported model should have tests of significance for all estimated parameters (necessity tests ) and an analysis of the residuals suggesting randomness/sufficiency) .

You might then try the alternative approaches to the simultaneous identification of Interventions and the arima process AND compare the results and report back to the list on your findings. Your free software search could include auto.arima ,tsoutliers, structchange AND the free 30 day trial versions of some commercial packages like SAS or AUTOBOX ( which I have helped to develop).

Notice that estimation is not the same as identification although it is usually the final step.

Finally although you say you are only concerned with "level shift changes" … this can't be done efficiently (correctly/safely) possibly without accounting for/adjusting for other the three other possible deterministic changes AND mwmory.

Hope this helps you .

edited Feb 28 '20 at 21:47

answered Feb 28 '20 at 21:30

IrishStat

27,906
5
29
55

Could you explain more what you wrote "What I suggest that you do is to take ... an analysis of the residuals suggesting randomness/sufficiency)".I don't have the pre-req knowledge to understand your suggestions (idk what memory and intervention effects are). Could you please dumb it down a litte? – Eric Kim Feb 28 '20 at 21:54
1

Treating this as an intervention detection problem would be far from parsimonious, because each "dip" or "bump" would require *four* parameters: two dates and two quantities. Generally this problem shouldn't require more than three parameters per bump, because the return should come very close to canceling the initial excursion. The examples in the question also show that the amounts of the excursions may be characteristic and nearly constant, which would greatly reduce the number of parameters to estimate. – whuber Feb 28 '20 at 22:00
when you specify your MODEL ….fitted values i.e. the predicted values and model residuals are available. The model residuals will then be screened to detect resilant effects (arima structure) AND latent deterministic structure that may be present. One form of latent deterministic structure is what you call abrupt change and what I call a level.step shift. You don't have to be an expert to correctly do this you need hmans working for you who know how to do this OR software working for you that does this efficiently. – IrishStat Feb 28 '20 at 22:03
@whuber ...just the start point and the end point and the coefficient thus one additional parameter ( the coefficient ) needs to be estimated GIVEN the form of the new 0/1 predictor variable . These are part and parcel of the search process outline in the TSAY article that I have referenced. The search process ( via trial and error) yields the beginning and the ending points. – IrishStat Feb 28 '20 at 22:07
1

Whether you find a parameter using a search or some other method, it is still a parameter and still counts as part of the model complexity. – whuber Feb 28 '20 at 23:33
each intervention variable/characteristic adds one parameter to the final model and decreases the degrees of freedom by one – IrishStat Feb 29 '20 at 00:14
I think that's incorrect, because it's impossible to describe a single intervention without specifying both the amount *and* the time. It clearly is the case that the parameters of multiple interventions are not independent, because specifying a single time for one intervention removes it from consideration for another. Nevertheless, it is evident that the description of a period of down time in a well according to your software will involve two interventions and therefor *four* numbers. – whuber Feb 29 '20 at 18:15
No... the predictor variable that is found in the search process might look like this 0,0,0,0,1,1,1,1,1,0 say for ten data points in time . Now when this series is used to model/explain an observed Y the resultant parameter associated with this predictor/variable is a coefficient/amount thus 1 degree of freedom is expended. If however the user thinks that it the effect is gradual to the new level then a second coefficient would be needed to be estimate/reflect the decay pattern – IrishStat Feb 29 '20 at 19:40
You are neglecting the fact that you did not specify that particular variable in advance: that's what uses up at least one more degree of freedom (in addition to its coefficient). – whuber Mar 01 '20 at 02:18
I appreciate this point. Where in the literature is there any reference to this additional penalty where the particular variable is found by a totally independent first step which is analogous to the brute force model identification via a scan of aic's . In effect the "finding of the particular variable" is done in the same manner by scanning error sums of squares to point to the most promising candidate.. – IrishStat Mar 01 '20 at 10:35
https://stats.stackexchange.com/questions/451779/how-to-identify-the-order-q-of-the-moving-average-part-of-a-sarima-model presents ( in the addendum) an example of how a variance change point is also detected and used . Should that finding also cost a degree of freedom ? or is just for intervention variables being discovered ? – IrishStat Mar 01 '20 at 11:38
In pondering this, I realized I very well could be wrong about this parameter counting issue (and even its implications) because I don't fully understand your procedure for identifying interventions. Are you doing a systematic search? What criterion are you using to include an intervention your model has contemplated? A low p-value, perhaps? If that's the case, do you adjust that criterion for multiple testing during the search? (I'm not trying to suggest you ought to do any of these things--I'm just sketching some of the possibilities that come to mind.) – whuber Mar 02 '20 at 21:26
1

The process is detailed in http://docplayer.net/12080848-Outliers-level-shifts-and-variance-changes-in-time-series.html is fundamentally is a trial and error. Consider a time series that has 10 values and you are searching fir an unusual value. Add a predictor series using 1,0,0,0,0,0,0,0,0,0 estimate a regression model and store the t value an call it candidate 1. No replace that predictor series with 0,1,0,0,0,0,0,0,0,0 and compute the t value and call it candidate 2 .do so 8 more times & then determine the max t value and given that it passes a given alpha test consider this a candidate. – IrishStat Mar 02 '20 at 21:38
Now proceed to consider possible augmentation of an indictor that looks like 0,1,1,0,0,0,0,0,0,0 and evaluate its t value then 0,1,1,1,0,0,0,0,0,0 and evaluate the absolute (always !) value ..do this 7 more times finally 0,1,1,1,1,1,1,1,1,1 , now select the absolute largest of step/level intervention series and compare it to the best of the absolute values of the t values for the pulse type. To answer your (reasonable question) about consideration for multiple testing , I do not as ultimately all suggested intervention variables face a step-down test . – IrishStat Mar 02 '20 at 21:48
things get a little more complicated when you already have an ARIMA structure and/or causal in place. Fundamentally you are asking if significant deterministic structure is in the current residuals and how to identify it , extract it and tentatively add it to the current model culminating in a gaussian error process. – IrishStat Mar 02 '20 at 21:52
Thank you for the detailed comments! – whuber Mar 03 '20 at 13:05
You are more than welcome ! Thank you for all that you do to make this a better place and an outstanding source for good practice. – IrishStat Mar 03 '20 at 13:31

Detect abrupt change in time series

2 Answers2

Linked