0

I recently got to work on a problem of forecasting five years of data. But I only had five data point from previous years i.e. yearly data (frequency = 1). The data is heteroskedastic. For example, [574,1346,2051,6700,40].

What would be a best model to estimate the next five years? On my research , a linear regression is out of the picture because we need homoskedastic data. An ARIMA model will fail because it doesn't have enough data.

Note: That is all the data we have.

Any suggestions?

Thanks,

Prerit

Slayer
  • 159
  • 7
  • Why is your last data point 40? – forecaster Aug 20 '17 at 21:41
  • Good question. that is the data. :P – Slayer Aug 20 '17 at 21:42
  • @forecaster Say, what if, it wasn't 40. What would you suggest a good methodology? Also, another situation. You have 100 such observations, each containing 5 years of data. But these observations are taken from different samples, hence are independent. – Slayer Aug 20 '17 at 21:46
  • Small data sets require models with few parameters and/or some form of regularization/informative priors. Your exclusion of regression and ARIMA is unfounded: regression places a constant variance assumption on the residuals, not the data (and I'm not sure why you claim your data is "homoskedastic" anyway), and there is no lower bound on sample size for ARIMA. If you have 100 short time series of a similar character, there is probably something to be gained by sharing information across series (either in the form of parameter constraints across series, or in a multi-level sort of scheme). – Chris Haug Aug 21 '17 at 00:13
  • @ChrisHaug You are right. I was under the impression that homoskedasticy equivalent to saying the homoskedastic data. Just my wrong use of word. I read about it. The residual plot form a funnel shape and the auto.arima in R converges to mean for all the next five years. which is not significant – Slayer Aug 21 '17 at 00:37
  • "auto.arima in R converges to mean". That's about the best guess that anyone could make, given so little data. – david25272 Aug 21 '17 at 00:44
  • @david25272 Totally agreed. Also, I am reading a paper by Rob Hyndman et el. on "minimum sample size requirements for seasonal forecasting models". And it says I need atleast m+2 observation according to that I can only estimate 3 years. – Slayer Aug 21 '17 at 00:52
  • Yearly data is not seasonal data the paper you cite will not help. Can you use or have access to analog data? If yes then you could use that, otherwise you have to use domain knowledge, market research, and other qualitative technieques. There is no statistical method that could provide data on 5 data points. – forecaster Aug 21 '17 at 01:24
  • Put it this way: why do you think that the best forecast *isn't* the mean? What do you believe or know, aside from those 5 numbers up there, that makes you think otherwise? That's what you need to exploit. For example, the first 4 points might look like an increasing trend, but you can't really "prove" that with so few points. Do you have independent reason to believe that a trend exists? Do you know what happened in the 5th year? Do you believe that the 5th year is an anomaly that will adjust by the 6th, or are we in a new regime? For the 100 series, do they have similar trending behavior? – Chris Haug Aug 21 '17 at 12:34
  • @ChrisHaug By all means, a mean is the best estimate. I have no supplement information besides those numbers. There are so few numbers that I cannot and will not know if there is a trend. No the 100 series represents different entity all by itself. As per my knowledge, say for example a growth of school doesn't depend on any other school so to speak. My ultimate and final thought is -- The data is useless given there is not enough samples to prove. Even 5 points cannot estimate the population mean. And that is what my opinion is. What do you say? – Slayer Aug 21 '17 at 13:07
  • Even a mean might be wrong estimates. We need to have large enough sample to apply CLT. – Slayer Aug 21 '17 at 13:08
  • The data is not "useless". Sure, it's not a big sample, but it's sufficient to signal to you that forecasting 2000 for the next period is probably a lot more reasonable than 100 billion, or 0.000000001. Focus on simple models, include priors based on out-of-sample knowledge, and accept that the forecasts will have high uncertainty. What do you need the CLT for, precisely? Not every analysis requires it. – Chris Haug Aug 21 '17 at 13:36
  • Ohh...CLT comes in because I think if we have small sample, it is not sufficient to say that the population mean would be equal to sample. But if we have large enough sample we can say with some confidence that mean is the best estimate. – Slayer Aug 21 '17 at 13:46
  • Also, I am worried about the analysis is because I feel this amount of data cannot be used for making a business decision. – Slayer Aug 21 '17 at 14:38
  • Maybe bootstrapping will help you – Beytullah Gonulal Feb 10 '19 at 14:20
  • You could start by simulating your variables taking in account information that is directly linked to your data set – Moreno Aug 20 '17 at 22:51

0 Answers0