Removing leading zeros from time series

Question

Currently, I am working with a lot of time series data. A lot of my time series data have a lot of leading zeros. For example,

ts = [0, 0, 0, 0, 0, 0, 256, 129, 345, ...]

All, of the time series I am investigating are on a monthly cadence. My overall goal is to get the best forecast after fitting with Prohet, ARIMA, etc.

My first thought with these time series that have many zeros, was to remove them and then fit a model. However, this does not always yield the best results in terms of RMSE or MAPE. So my questions are:

Is it customary to remove leading zeros from a time series?
Are there any types of analysis or tests that can determine when to remove leading zeros from a time series?

I have done some searching online but there is not a lot of information on this topic that I could find on my own. Any comments or resources would be greatly appreciated.

score 7 · Accepted Answer · answered Jun 17 '19 at 20:42

No, this is not customary.

Your time series may be an intermittent-time-series: demand is often zero and sometimes nonzero. In this case, leading zeros would just represent bona fide zero demand. Removing them would bias your forecast upward.

Conversely, your leading zeros might result from padding a series with zeros, and represent an artifact of data munging. In this case, you should of course remove them.

You need to understand where your data comes from and what has been done with it in order to know what to do with it.
If your series starts with a "long" string of zeros and there are "few" zeros after, then you probably have the second one of the two cases above, and you could remove the zeros with a certain level of confidence. Appropriate values for "long" and "few" will depend on your time series, especially on its time granularity and what it actually represents. You should also remove the zeros if they are physically impossible or at least highly implausible, e.g., if they represent humidity measures taken someplace wet.

Bottom line: subject matter expertise is pretty much indispensable here.

Incidentally, you write that "this does not always yield the best results in terms of RMSE or MAPE". Be aware that the RMSE and the MAPE are usually minimized by different point forecasts. See Why use a certain measure of forecast error (e.g. MAD) as opposed to another (e.g. MSE)?. I have an upcoming invited commentary on the M4 forecasting competition, which should soon appear in the International Journal of Forecasting and which discusses this further. If you are interested, I can send you the manuscript.

Thank you @Stephan Kolassa for the response. That makes sense, and with certainty, I can say that the series that have zero values at the beginning of the series are because the data was not observed for that period. That manuscript sounds also like a great read for me because I am still trying to understand what measures are appropriate and which others are not when it comes to fitting time series. — RDizzl3, Jun 17 '19 at 20:48
If you know that zeros don't make sense, that is great - then you know you can remove them. If you want that manuscript, send me an email at Stephan dot Kolassa at sap dot com (and ping me here in a couple of days if I don't answer, in case your mail ends up in my spam folder). — Stephan Kolassa, Jun 17 '19 at 20:50

score 0 · Answer 2 · answered Jun 18 '19 at 08:48

0

My approach ( given that it is not intermittent demand data ) is to reverse forecast. Take your time series and arrange it from last to first (without tailing zeroes) and model it to obtain predicted values for the unrecorded past.

See https://stats.stackexchange.com/search?q=user%3A3382+reverse+forecasting . Additionally some researches use the term backcasting to fill in /estimate values that went unrecorded. Also see hindcast What is the proper name for a backward forecast?

answered Jun 18 '19 at 08:48

IrishStat

27,906
5
29
55

Aren't you implicitly re-interpreting the zeros as missing data? – whuber Jun 18 '19 at 13:55
yes because they would be inconsistent with the prediction and thus flagged as exceptional.. The size (complement) of that estimated intervention/pulse is then the estimate of the missing value. – IrishStat Jun 18 '19 at 14:03
Thank you -- that makes sense. I presume by "flagged as exceptional" you mean "identified *a priori* as a sequence of initial values that might not be characteristic of the remaining time series and whose presence could adversely affect the results of the analysis." But if that's the case, why would backcasting be of any use at all when the stated purpose is "to get the best forecast"? Isn't your proposal tantamount to saying "just drop the initial zeros"? – whuber Jun 18 '19 at 14:13
In the example where the replacement values are estimated (from the artificial first step) one would have to actually redefine their original data ( going forward) with the replacement values estimated from the backcast replacing the 0's. Thus in the second step the data would be cleaner as the 0's would be replaced with estimates from the artificial backcast. and would now generally be non-zeroes. – IrishStat Jun 18 '19 at 14:22
Right--but why? Since the zeros are replaced by estimates from the ensuing series, they add no new information. Thus, if there is any justification for this extra work of imputing the leading values, it would be in a demonstration that it improves the forecasting procedure. Such an improvement seems unlikely. Indeed, this process could be misleading, because if a standard forecasting procedure is applied, it will "think" it has more data than it actually does and therefore will produce prediction intervals that are too narrow. – whuber Jun 18 '19 at 14:27
The improvement would be in the forecast as the values used to launch the memory-based predictions would be non-zero. Why don't you post a question and include sample data for a univariate problem where the most recent values are 0.0 since they were unrecorded and I will show all the steps since my words don't seem be enough for you and probably others. The prediction intervals using monte-carlo may be marginally if at all effected besides the fact that the intervals would be in error because the expected value would be in error due to being launched with 0's .. – IrishStat Jun 18 '19 at 14:33
When someone makes an extraordinary claim, the onus is on them to demonstrate it. In this case your claim comes down to this: if we take *any* time series and prefix it with a number of zeros, your procedure of backcasting those zeros and using them for forecasting will improve on a straightforward forecast. Because that flies in the face of mathematics and intuition, it's extraordinary--but it's possible to conceive of certain forecasting *procedures* that might, by some accident, be thereby improved. It is up to you to exhibit such procedures and demonstrate this property. – whuber Jun 18 '19 at 14:37
1

ok I will do so and add it to my response. – IrishStat Jun 18 '19 at 14:41

Removing leading zeros from time series

2 Answers2