Forecasting short time-series at scale

Question

I am looking for a method of forecasting short time series. I need to make multiple such forecasts at parallel, so I need some simple method that scales to large data. My data looks like $n\times k+1$ (where $n$ is in thousands or more and $k$ is something between 3 and 10) matrix

$$ \begin{array}{ccccc} x_{1,t-k} & x_{1,t-k+1} & \dots & x_{1,t-1} & x_{1,t} \\ x_{2,t-k} & x_{2,t-k+1} & \dots & x_{2,t-1} & x_{2,t} \\ \dots \\ x_{n,t-k} & x_{n,t-k+1} & \dots & x_{n,t-1} & x_{n,t} \\ \end{array} $$

The series may, or may not, be correlated with each other.

I need to make forecases for the $x_{i, t+1}, x_{i, t+2}, x_{i, t+3}$ points (the forecast horizon would hopefully be short) for each $i$-th series. In most cases I can assume some simple linear trends (I am not hoping for much more since the limited data for each of the series), so I could simply try something like a linear regression to approximate AR($k$) model, but in my data, from time to time after a steady linear upward trend I observe a random drop and I would like my method to be somehow sensitive to noticing such change (e.g. when there is a change between $x_{i,t-1}$ and $x_{i,t}$) and not suggesting linear upward trend after noticing it. Below I show a made-up example that looks similar to my data, that was created using the a10 dataset from fpp package by Rob Hyndman.

library("fpp")

s1 <-  seq(1, 100, by = 10)
s2 <- seq(10, 110, by = 10)

X <- NULL

for (i in seq_along(s1))
  X <- rbind(X, a10[s1[i]:s2[i]])

All this means that the common simple methods like linear regression; random walk forecast that takes $x_{i,t}$ as a forecast for the $t+1, t+2, t+3$ values; random walk with drift; predicting mean of previous timepoints etc. do not seem as a good choice for me. On another hand, limited data and computational limitations make more advanced methods not feasible.

I was thinking of something like using simple exponential smoothing of the changes between the timepoints $\Delta_{i,t} = x_{i,t} - x_{i,t-1}$, i.e. taking

$$ \begin{align} \ell_{i,t} &= (1-\alpha) \ell_{i,t-1} + \alpha \Delta_{i,t} \\ \hat x_{i,t+1} &= x_{i,t} + \ell_{i,t} \\ \hat x_{i,t+2} &= \hat x_{i,t+1} + \ell_{i,t} \\ \dots \end{align} $$

with some pretty hign $\alpha$ (this could be optimized), but I do not want to re-design the wheel, so I'm looking for some hints either on this approach, or suggesting something better then this. I'd need prediction intervals for my forecast, so I'd be grateful also for a comment on this.

Another problem is that this approach ignores the fact that I could also use the other series to build a general model. I believe that using the whole dataset could lead to improvements (some kind of shrinkage). Maybe I could use something like hierarchical forecasting described by Rob Hyndman (see e.g. those slides or the hts vignette) where the parameters to aggregate the individual series could be estimated using regression (this easily scales).

Can you elaborate on why simple methods do not seem a good choice? — Stephan Kolassa, Nov 06 '17 at 10:32
@StephanKolassa rw or mean are flat and in most cases I'll be seeing the upward trend (as in the data example); regression or rw + drift do not adapt to random drops. — Tim, Nov 06 '17 at 10:33

Stephan Kolassa · Accepted Answer · 2017-11-06T11:04:55.373

A good method for forecasting short time series is double exponential smoothing, which I would prefer to applying single exponential smoothing to the differenced series. This will also allow for changes in trends, in contrast to a simple linear regression. Here is the relevant section in FPP, 2nd edition. The state space framework that, e.g., forecast::ets() uses will give you prediction intervals.

If you can meaningfully group your time series (or at least some of them) so they exhibit common patterns, you might be able to improve forecasts by forecasting aggregates (on which the signal is hopefully more easily visible) and adapting lower-level forecasts. Here is the relevant chapter in FPP2. I have repeatedly found this "optimal combination" approach to improve forecasts on all levels of the hierarchy.

Making optimal combinations play nice with prediction intervals (or even better, predictive densities), is an ongoing research topic. There was a presentation on "Coherent probabilistic forecasts for hierarchical time series" by Souhaib Ben Taieb, James Taylor and Rob Hyndman at this year's International Symposium on Forecasting. I'd recommend you ping the authors and ask for the presentation.

Of course, if you don't know a priori which time series will be positively correlated so you can hierarchically treat them, this is a problem - especially for short time series. You could simply cluster them, using one minus some correlation as a distance measure. But of course, if your series are short, then you might be clustering a lot of noise.

As to change point detection, this is of course also a hard topic for short series. The strucchange package for R contains a number of useful functions. You might want to apply these to differenced series, or to the residuals in-sample, so your breakpoint detector is not thrown off by the trend.

If you can cluster your series a priori (or an automatic clustering works well), then you might be able to detect the change points on the aggregates instead of the individual series, where once again, the signal will be stronger.

The problem of course then is what to do with detected change points. A state space framework should of course be able to include an external step change predictor, but unfortunately I believe that neither forecast::ets() nor the functions in the smooth package allow for this (but it might be good if you checked, I'm not entirely sure). Alternatively, you could use auto.arima() and feed a change point indicator into the xreg parameter. Here is Rob Hyndman explaining the difference between regression with ARIMA errors and ARIMAX, recommended reading.

EDIT - you write:

Maybe I could use something like hierarchical forecasting described by Rob Hyndman (see e.g. those slides or the hts vignette) where the parameters to aggregate the individual series could be estimated using regression

I agree that the hierarchical approach could definitely be helpful. However, if you estimate "optimal" coefficients for aggregating series, don't forget that this will be an additional source of variance, which could make your forecasts worse than just using flat weights of 1 (see Claeskens et al., IJF 2016, "The forecast combination puzzle: A simple theoretical explanation").

Forecasting short time-series at scale

1 Answers1

Related