Optimal time window to consider in a time series analysis

Question

I am currently doing a time series analysis problem wherein I have forecast sales units for a particular commodity. Assuming I have 5 years monthly data, is there any mathematical/statistical way to choose how much should I choose for the purpose of building the model.

I am confused in the following :

In general it is said that the more data you take, the better it is, does it hold in case of a time series as well?
Patterns change a lot over time, so it may not be wise to consider data points that are too old, say older than 2 years.

As already mentioned, I am looking for a statistical method (if one exists) to find an optimal time window for training.

That's is not the answer, but a hint. Try using as short input vector as possible (1 is a good guess), measure the performance. Increase it's size 10 times, then 100 times. Plot the performance of the models versus length of input. This simple story will tell you a lot. — Alexey Burnakov, Apr 24 '18 at 18:20

Skander H. · Answer 1 · 2018-04-23T04:33:09.670

In general it is said that the more data you take, the better it is, does it hold in case of a time series as well?

Yes it does.

Patterns change a lot over time, so it may not be wise to consider data points that are too old, say older than 2 years.

Yes patterns do change over time, however the more sophisticated forecasting methods are able to take that into account - so you should still feed them as much data as possible. The only situation where you would be feeding "too much data" to your model is if you get to the point where you are going to overwhelm the computational resources available to you. That is not going to happen with 5 years of monthly data (i.e. 60 data points).

As for the long term changing patterns, typically a time series is decomposed into 3 components: A trend (the long term variation), a seasonal component, and residuals. Some methods, like Triple/Seasonal Exponential Smoothing models try to model this directly. There are also Double Exponential Smoothing models which consider only a trend, but don't consider any seasonality.

Another way of looking at it is in terms of stationarity: A (weakly) stationary time series is one whose statistical properties (mean, variance) remain the same over time. To your point, real life time series data changes over time and is non-stationary. So some methods (namely ARIMA models) will first transform the data into a stationary time series using differencing, try to predict the future values of the stationary series, and then reverse transform the forecasts (to bring back the effect of the long term variations into the model).

As already mentioned, I am looking for a statistical method (if one exists) to find an optimal time window for training.

You needs to be careful here, because based on your question, you seem to be confusing two separate concepts. The optimal time window for training is not the same thing as the amount of data you should use.

The overal amount of data that you should use to train your model is all of it, as I explained above.
The optimal time window for training is not a fixed concept and is going to depend on the method you use for forecasting and the business case you are dealing with. For 5 years of monthly sales data, your data is most likely going to have a a seasonality of 12. On top of that, some forecasting methods require that you specify the number of lagged time periods used to build your model. This is the case with ARIMA models and Neural Network based forecasting models for example.

Again the number of lagged time periods is not the same thing as the total amount of historical data used. The number of $k$ lagged time periods assumes that at any given point in time, the value of my series $X_t$ is determined by at most by the values of $X_{t-1}$, $X_{t-2}$,...,$X_{t-k}$.

Consider a simple auto-regressive model of order 3 ($AR(3)$) of a time series (note this model is oversimplified ):

$X_t = \alpha_1 X_{t-1} + \alpha_2 X_{t-2} + \alpha_3 X_{t-3}$

So $X_t$ depends only on the last three periods (that is my optimal window so to speak) - but I still need to use my entire history, not just the data from the 3 last months, to find the best estimate of $\alpha_1$, $\alpha_2$ and $\alpha_3$.

To answer your final question: Statistical methods for finding the number of lags are the ACF and the PACF. They are usually applied only after the time series has been made stationary. Seasonality is usually something that you can figure out from your business case (monthly dales data will have a seasonality of 12, hourly restaurant customer data will have a seasonality of 24, Daily customer data at a movie theater will have a seasonality of 7, etc...)

What if your time series has holes here and there (e.g. when the sensor fails or is disconnected)? Then "all of it" is not the optimal input, correct? Or is there some way to account for this? — HorseHair, Jun 29 '19 at 17:18

IrishStat · Accepted Answer · 2018-04-24T18:09:38.387

0

The optimal window that you looking for has to do with "what subset of the data should I use if any ?" . There is no useful rule of them like "just use the last k observations to identify an appropriate model" the data will tell you that.

If you have say NOB observations and you are interested in a k period out forecast , you might begin by using NOB -k = N values to develop a model with a set of resultant parameters. Now take the N observations ( say 1000 ) and segment them to n1 and n2 for a particular time point ( E.G. 100 versus the most recent 900) and construct a CHOW TEST for constancy of parameters to test whether or not this is a significant breakpoint in parameters. . Do this for each particular time point (e.g. 150 and 850 ... 900 and 100 where 500 and 500 would be splitting the data into two halves) to find that point yielding the maximal contrast between n1 and n2. Of course the minimium # of observations should be sufficient to be able to identify the arima structure.

This will provide with the time window ( i.e. the n2 ) that you should use and simply ignore the first n1 observations. This is the useful procedure that I programmed into AUTOBOX some ten years ago to make sure that too much data is not used . In many many cases that we have looked at using all the data is a flawed approach and selecting the right amount of older data to discard can be critical.

@Alex asked for an example ... . I believe this is the "killer example" that was used to launch the subject in 1980 by Tong .. .In 1990 I became (painfully aware) that the idea of estimating one set of parameters for the entire data set premised that the parameters were constant over time . That motivated me to implement a practical approach to validating that hypothesis and upon rejection dealing with a recent subset (not too small ! ) that was indeed homogeneous in it's parameters or at least did not suggest heterogeneity as you can never really prove anything to be true..

This issue has been largely ignored by software developers except for the most dilligent.

The graph clearly shows that things are more persistent in the second half and that the acf(1) is very small for the first half and quite large for the second half ...Q.E.D.

edited Apr 24 '18 at 18:09

answered Apr 22 '18 at 19:34

IrishStat

27,906
5
29
55

If $n_1 \approx n,$ then you would be recommending throwing out almost all the data, wouldn't you? Since something must be preventing this circumstance, what principle are you applying to prevent it and why does it work? – whuber Apr 22 '18 at 20:35
if n1=n2 that would suggest splitting the data and using the most recent half. Sorry if that wasn't clear. – IrishStat Apr 22 '18 at 21:22
whereas the original CHOW test assumed perfect knowledge of the separation from set 1 & set 2 .. I am simply searching for the optimal separation point ,,,,recursively. Nice to have u overlooking/critiquing as it only helps us all – IrishStat Apr 22 '18 at 21:28
My concern is that your approach seems likely to (a) discard useful data and (b) overfit. I was asking what steps you have taken to avoid those problems. It must be lurking in your definition and evaluation of "optimal," but (so far) you haven't disclosed this important detail. – whuber Apr 23 '18 at 14:07
The approach minimizes the risk of overfitting by setting a minimum # of values in each sector . For example if we had 120 monthly values the contrasts would be 49 vs 71 , 53 vs 67 ......73 vs 47 thus there would be enough "local observations: to deal with estimation without overfitting – IrishStat Apr 23 '18 at 15:21
How exactly is the minimum number determined and what is the theoretical basis for this? How can one demonstrate that potentially removing the majority of the data still leaves "enough" data? – whuber Apr 23 '18 at 15:33
To test for seasonal arima with monthly data it has been the standard practice to suggest that at least 3 seasons should be in effect i.e. 36 . Rounding led to the number 47 as a "safe estimate' to use. The rule of thumb of 3 full cycles can be taken to 4 or 5 if you wish but we have found that 3 is reasonably sufficient via exhaustive testing. All heuristics at one point or another use "rules of thumb" : Definition of rule of thumb. 1 :a method of procedure based on experience and common sense. 2 : a general principle regarded as roughly correct but not intended to be scientifically accurate. – IrishStat Apr 23 '18 at 16:40
It is possible to demonstrate by actually changing the rule of thumb and evaluating subsequent effects on fitting statistics (my preference) and.or out-of-sample values which suggests that "the tail wags the dog " . The AUTOBOX user can specify this rule of thumb interactively – IrishStat Apr 23 '18 at 16:43
"In many many cases that we have looked at using all the data is a flawed approach and selecting the right amount of older data to discard can be critical." Can you give an example of a time series where this happens? – Skander H. Apr 24 '18 at 04:40
@Alex Why don't you open up a new question entitled "Please provide me a case study where too much data exists and is counter productive ? " as I can then provide some clarity. – IrishStat Apr 24 '18 at 07:55
@Alex specifically how do I go about attaching a csv file from my machine so that you and others can get it ? – IrishStat Apr 24 '18 at 09:49
@IrishStat we wouldn't need the entire csv, just a graph of the data would do I think. – Skander H. Apr 24 '18 at 16:10
https://stats.stackexchange.com/questions/342512/are-there-ever-situations-where-adding-too-much-data-to-forecasting-model-reduce – Skander H. Apr 24 '18 at 16:24

Optimal time window to consider in a time series analysis

2 Answers2

Linked