7

Below I'm showing just a small subset of a larger set of measurements of a process that I'm using to in turn predict something else. The part of the process that is my signal of interest is the random-walk. I've posted the data in csv format for those especially interested, but it is not necessary to look at this to answer my question.

I've fitted an ARIMA(1,1,2) model to my signal (after log tranforming it). It was the best by AIC/SBC model selection, and the prediction is overlaid on the original (after log transform) below:

Predicted Overlaid on Actual

And residuals look like white noise to me (no test yet performed for that though): enter image description here

  • In general, how do I get the prediction at each time step for the random walk portion of the ARIMA model?
  • The outlier with a value around -5 is a bad data point and I'd like to exclude it. If it's not broadening the question too much I'd like to know how to exclude data points that fall outside of pre-determined limits during an online prediction.
  • I did notice my residuals show a change in variance, so if that violates some kind of ARIMA model assumptions or something let me know.
Jeffrey Girard
  • 3,922
  • 1
  • 13
  • 36
JPJ
  • 1,161
  • 8
  • 15
  • I can't identify a clear question here. "Confirm my understanding" followed by an extensive list of things to check doesn't seem to fit the question-and-answer format (and in particular sounds likely to lead to the [non-answer](https://blog.stackoverflow.com/2011/01/real-questions-have-answers/) answer "yes, that's correct"). "I have no idea how to " isn't a question either. "Put me on a path" is basically "try to guess how to teach me ". Is it possible for you to rephrase this into one or more question posts that are actually questions-with-answers? – Glen_b Oct 27 '15 at 01:57
  • 4
    Interesting Q's but @Glen_b makes good point. Ideas that might help rephrase the Q follow. ARIMA is a set of models. Box-Jenkins is a methodology, which requires (inducing) stationarity. State-space form (SSF) is a representation. E.g. all ARIMA models can be written in SSF. SSF is very flexible, does not require differencing, and an associated methodology is Harvey. Estimation of SS params can be done with KF via prediction error decomposition. IMO they kind of main aspects. Good luck! – Graeme Walsh Oct 27 '15 at 02:24
  • Thanks guys -- I admit I'm fighting that "I don't know what I don't know" situation here so I'll try to adjust my question given your feedback. – JPJ Oct 27 '15 at 04:27
  • Post your data set and your current model that the automated system built to dropbox.com and this can be used to benchmark the different methods. As for ignoring outliers, that is always a bad idea. – Tom Reilly Oct 27 '15 at 13:47
  • 1
    I will do that Tom. Honestly probably the best idea to help me get it! – JPJ Oct 27 '15 at 14:50
  • Why is this still on hold? The question has been edited per comments. – JPJ Oct 27 '15 at 19:34
  • 1. It's still too broad. It asks two quite different questions. Ask as two questions (by editing this and the second question you posted to each have one of the different question you asked about above). 2. There are some details you link to that should be directly accessible to someone trying to read the question. Put what can reasonably be in the question here in the question (a smaller subset of your data, for example, and any important parts of output) – Glen_b Oct 27 '15 at 22:13
  • I've cut out the second question above. It's still a fair way from being fixed though (see previous comment). Note that if you're asking "how do I use JMP" that question will probably close as [off topic](http://stats.stackexchange.com/help/on-topic). If you're asking "how do predictions work for random walks in ARIMA" that's on-topic, so you'll also need to clarify that. – Glen_b Oct 27 '15 at 22:19
  • I couldn't get your pdf to open or download. Please put the most critical parts into your question. – Glen_b Oct 27 '15 at 22:48
  • I'd suggest a time series plot at the least should be placed in the question. – Glen_b Oct 27 '15 at 23:26
  • @Glen_b will do -- and regarding JMP -- no I didn't want any input regarding JMP software, I just wanted to make it clear where I was getting the output from. – JPJ Oct 28 '15 at 00:11
  • I added a time series plot myself (which you should replace with whatever relevant information you have, including your own such plots); there are a couple of zeroes in that series, but even ignoring those, the logs of the rest of the series are somewhat skew. – Glen_b Oct 28 '15 at 00:33
  • @Glen_b oops I edited right after you so never saw the plot. Hopefully my edited version is getting closer to typical scope and format on these forums! – JPJ Oct 28 '15 at 00:38
  • There's no problem. How did you deal with the zeroes when you took logs? – Glen_b Oct 28 '15 at 00:40
  • 1
    @Glen_b shoot, JMP just ignored them and I didn't catch it! That's the problem with using the higher level software packages. But I guess for discussion sake I would treat those the same as any other outlier since I know those are bad values in this case. – JPJ Oct 28 '15 at 00:48
  • 1
    You might want a couple of introductory sentences at the start of your post describing what your data are and what you're trying to achieve with this model. Anyway, I'm reopening on the basis that it's no longer too broad, but it may still run into trouble on other criteria (requiring a few additional edits). – Glen_b Oct 28 '15 at 00:59
  • @Glen_b Thanks so much for your advice/help on my first couple posts!! – JPJ Oct 28 '15 at 14:53
  • Not sure if this helps you, but the Variance is increasing at period 1833. If you use Tsay's variance test to use WLS you can remedy that www.unc.edu/~jbhill/tsay.pdf There is also a level shift up at period 4774+(1.01) that can be modeled with a dummy variable. There are about 30 outliers above 10 units and 60 others of note. The model we built was an AR3 with lags 1,2,3 all significant and no need for logs. – Tom Reilly Oct 28 '15 at 15:26
  • the residual variance is clearly growing, so log transform didn't work – Aksakal Oct 28 '15 at 15:29
  • @TomReilly I'm still looking at the reference but it looks to be helpful from first glance. Forgive my ignorance if this is obvious to more knowledgeable folks -- but how does your AR3 model account for the random walk? This signal is a shorter version of continuous measurements that "walk" all over.... – JPJ Oct 28 '15 at 18:21
  • @Aksakal the residual variance grows here, but over a longer time period it can be seen that it will grow/shrink rather randomly (actually it is caused by something else for which I am also measuring and have values of). However I'm not sure if that matters to my goal of trying to predict the random walk part, or if I need to pull that into the model somehow to generalize my model and avoid overfitting to this particular realization? – JPJ Oct 28 '15 at 18:53
  • 1
    You're fitting a constant variance model to non-constant variance process. The implications could be various depending on the goal of forecasting. For instance, in log-difference model the variance impacts the expected drift, as you probably know, so your forecast could become biased. Your estimates of the parameter covariance matrix will be messed, it may or not matter, depending on what your goal is. – Aksakal Oct 28 '15 at 20:09
  • Thanks -- I guess you're kind of confirming then that my fitted model does have a constant variance assumption and I need to explore options to account for the heteroskedasticity. – JPJ Oct 28 '15 at 22:14
  • 1
    See http://stats.stackexchange.com/questions/18844/when-and-why-to-take-the-log-of-a-distribution-of-numbers when one needs to take logs. As you said you need to find a more appropriate solution . Here is a reference for dealing with determinstic change points in variance http://www.unc.edu/~jbhill/tsay.pdf which we have fully implemented for both ARIMA and Transfer Function models. It speaks to the idea of Generalized Least Squares via weighted estimation. Tsay didn't specifically say this but the idea of determining and using weights is quite clear fom his seminal paper. – IrishStat Oct 29 '15 at 10:45
  • Well ... what did the Professor say ? Did he say " one way to learn a subject is to try and teach it" ? or more precisely "try and answer a question" ? – IrishStat Dec 12 '15 at 18:52
  • I've come a long ways since I first asked this question. I ended up modeling the process as a random walk + AR2 + AR1 in state-space form and it worked quite well. I decomposed the process using FIR filters to give me the RW, Oscillations, and left-over noise. This worked well because there wasn't much spectral overlap between these things. After that I optimized the process driving noises for the RW and AR2 process by fitting LS to the outputs from the FIR filters. I now think that this question might be more appropriate in signal processing forums... – JPJ Dec 13 '15 at 19:18

2 Answers2

1

Thanks to the help on this forum i was also able to ask this consolidated form of the original question to one of the profs at my university who is teaching a time-series class this semester....which i should have taken :-)

I thought his answer was pretty good, and also has some echos of other comments posted for this question.

I'm not so sure I should accept this answer officially...but at least want to share it here.

Prof Answer:

You have some strange patterns in your data… it looks like there is some type of “structural break” (the form or pattern of the time process changes around time point 3000). It might make sense to break the time series into two pieces and analyze each separately, if everything else is not working so well.

If covariances are indeed changing over time, then you would be violating the Box-Jenkins model.

I would try some intermediate steps. First just difference your time series: compute Y_t= X_t - X_{t-1} (after log transformation) and see if it looks like there’s some similar pattern throughout. If so, then you could try fitting a ARMA model to the difference. Are you using a software package that computes ARMA models? Can you fit such models on the differences Y_t?

Sometimes software packages will compute the k-step forecasts for the future… if you had those, then you could undo the differencing to get your predictions on-line. So for example, you get predictions in the future for Y_{n+1},…,Y_{n+k} and you know Y_{n+i} = X_{n+i}- X_{n+i-1} so you can find X_{n+i} = Y_{n+i}+X_{n+i-1} for i=1,…,k by the predictions for the differences Y_{n+i} and the observation you have for X_n. Otherwise, you might have to resort to Kalman filtering, which can be ok too.

If you have any outliers you could just delete them in fitting the model. Essentially, you would have no information at that time point, but that’s better than misleading information.

I guess that I just wonder (based on your initial graph) about how the fit of the time series model might improve if you considered breaking the time series into two pieces. Maybe you would do a better job of capturing extremes in the 2nd portion of the series.

JPJ
  • 1,161
  • 8
  • 15
0

" did notice my residuals show a change in variance, so if that violates some kind of ARIMA model assumptions or something let me know."

Your residuals suggest non-constant error variance and thusly you should employ a Generalized Least Squares (GLS0) model as suggested by http://www.unc.edu/~jbhill/tsay.pdf . Your selection of a log transform is probably unwarranted . AUTOBOX a piece of software that I have helped develop can seamlessly put together a solution that includes a minimally sufficient ARIMA model , level shifts/local time trends , outliers and the GLS that you apparently need. A seconD thought is that perhaps your ARIMA coefficients are time-varying which can also induce/create the appearance of non-constant errors . This facility/capability is also available

If you share this with your Professor he may have something to say about Tsay .

IrishStat
  • 27,906
  • 5
  • 29
  • 55
  • Recommending fitting differences is a very dangerous option as is assuming any other form of a filter ( like power transformations) . Data such as ours an reflect a myriad of causes based upon the symptoms that you observe (compute). I would wager that you have a time series with a deterministic change in error variance strongly suggesting GLS ( weighted regression/ARIMA) with a number of pulses. Since most software packages don't have this feature most analysts ( even professors ) have no experience in this regard. – IrishStat Nov 17 '16 at 14:30