ARIMA parameters for a very frequent data set

Question

I have some data that is gathered once every 5 minutes and I'd like to create an ARIMA(p, d, q) (P, D, Q) m model for it in order to forecast what might come next. I have a week's worth of data, being gathered at every 5 minutes, there are 2016 data points in the set.

The data looks like this:

The data is clearly seasonal by day. The problem is that I have too many points in a day (288) in order to set the m parameter for ARIMA to 288 (288 = 24*12 = 24hrs in day where each hour contains 12 periods of length = 5 minutes). Is the value 288 correct for m? If yes, what can be done about this high value?

The autocorrelation plot, as a whole looks like this:

Now, as I understand from the web, this plot being sinusoidal tells me that I need the nonseasonal part of the ARIMA model to look something like this: (p,d,0).

The problem is that when I try to create the Partial Autocorrelation plot like this (in python, using the statsmodels.graphics.tsaplots package):

plot_pacf(data, lag=50)
pyplot.show()

It takes a very long time, presumably because the data is too frequent? This would allow me to choose the p and P parameter depending of where I had a spike in the data (at the beginning for the nonseasonal part and later in the data, for the seasonal part).

Still, the PACF plot for a single day looks like this:

plot_pacf(data.iloc[0:288])
pyplot.show()

Now this suggests a p of 2, since it is at lag 2 that I have the last peak.

But from this I cannot infer neither P, nor D, since the plot is just a single period. What should I do to be able to determine the P and the D parameters from the complete 7 day data set?

why don't you post the actual data in single column csv file ? — IrishStat, Nov 24 '17 at 18:43
Looking at your data, it seems to be most important to understand why some of the non-zero periods are short and some are long (compare the second and the third bump), and why some are high and others low. ARIMA is not overly good at high frequency data, especially with seasonally inactive data. Consider a [tag:bats] or [tag:tbats] model. Here is an earlier thread with data that looks similar: [Explain the croston method of R](https://stats.stackexchange.com/q/127337/1352) (the title is misleading) — Stephan Kolassa, Nov 24 '17 at 18:55
@IrishStat, here's the CSV file with the timestamps and data columns: https://www.dropbox.com/s/gzusbfc7hc2cjlw/paul.csv?dl=0 — Paul, Nov 24 '17 at 20:45
@StephanKolassa, thanks, I will look over what you recommended me — Paul, Nov 24 '17 at 20:45

ARIMA parameters for a very frequent data set

0 Answers0