2

I am trying to fit a time series model to the following data. It seems to be seasonal. Would an ARIMA model be good?

enter image description here Here is the data:

Count

2 1 4 5 4 8 7 11 4 4 11 7 10 7 0 19 13 13 11 9 8 16 10 12 9 7 21 9 10 6 7 19 18 9 19 15 14 17 9 10 10 13 15 20 15 12 15 16

The numbers are separated by spaces.

Damien
  • 743
  • 7
  • 17
  • 1
    The data do not appear to be the same as the plot. (Where is the 42 in the data on the plot? Where is the value of (28,0) on the plot shown in the data?) – whuber Jul 23 '12 at 18:47
  • @whuber: The data starts at t=12. – Damien Jul 23 '12 at 18:54
  • Usually time is the abscissa and the observed value is the ordinate. You have it switch which I think is what confused whuber. – Michael R. Chernick Jul 23 '12 at 19:45
  • @MichaelChernick: Isn't my data correct if we assume the x-axis is time? – Damien Jul 23 '12 at 20:21
  • @Damien Yes I realized that and mentioned it in my answer. I think you have fixed the labelling now. – Michael R. Chernick Jul 23 '12 at 20:30
  • @Michael, please plot the data (or look at the plots produced in the answers) and compare that to the plot in the question: they are different. – whuber Jul 23 '12 at 20:34
  • @whuber: I edited the data – Damien Jul 23 '12 at 20:40
  • 1
    Damien, editing the data was a bad idea, because you have already received several detailed responses that used the data you originally posted. It's unfair of you in effect to pull the rug out from under those who have gone to that work to help you. – whuber Jul 23 '12 at 20:43
  • @whuber I have realizeed that the lablling of the axes was probably the problem and I think Damien has ffixed it. – Michael R. Chernick Jul 23 '12 at 20:43

3 Answers3

2
  1. delete the leading zeroes as they can inflate the autocorrelation function
  2. a visual suggest possibly a level shift and then a slight upward trend
  3. a few anomalies , maybe just one , (pulses)
  4. no apparent seasonal structure.

An ARIMA model would be good just as long as the reflections above were considered.

If you want to post the data , I will be more specific as to the applicability of ARIMA.

The 114 values you posted are quite different from your original plot. The actual-fit-forecast isenter image description here. The acf of the original series shows little structure enter image description here . The "best model" contains no ARIMA structure but evidences a few unusual data points and three distinct means or GROUPS [1-32 ; 33-69 ; 70-114 ] enter image description here with outliers enter image description here] .[4] . The acf of the residuals suggests randomness![enter image description here . What we have here are three arima models of the form (0,0,0)(0,0,0) with three different means or regimes XBAR1=8.0 ; XBAR2=14.826 and XBAR3=10.8572. One could consider this single-dimension cluster analysis (see Univariate clustering of time series )

IrishStat
  • 27,906
  • 5
  • 29
  • 55
  • Can I email you the data? – Damien Jul 23 '12 at 18:23
  • 6
    Damien, because that kind of personal communication circumvents the purpose and mechanisms of this site it is *strongly discouraged.* – whuber Jul 23 '12 at 18:36
  • @Damien: Consider providing a public Dropbox (or equivalent) link instead. – cardinal Jul 23 '12 at 18:49
  • 1
    @IrishStat I see from your plot that Huber was right. his data looks nothing like the original plot even if the labelling was corrected. It does look like two level shifts with no apparent trend at all. there is one very distinctive outlier in the first portion of the series and possbly another after the second shift. My suggestions wouldn't work for this plot. Mine only pertained as possibilities to the original plot. If there was an issue with the private communication i think it is resolved as you have exhibited the data and your modeling of it very nicely for CV. – Michael R. Chernick Jul 24 '12 at 00:26
  • 1
    @IrishStat it looks to me that the level shifts explain a lot of the variation, The remaining problems are the outliers and the job for the OP to come up with a sensible explanation for the apparent behavior. – Michael R. Chernick Jul 24 '12 at 00:28
  • 2
    The data are counts and ARMA models are not good choice to analyze them (even after transforming the data). It's better to use generalized ARMA (GLARMA) models. – hbaghishani Jul 24 '12 at 11:24
2

I'm not sure the data you added to your post is the same you used to make the plot. At any rate, it doesn't really matter since we're trying to help with the underlying methodological aspect of the problem.

From whatever information we have, i would advise a simple median filter:

The idea is to circumvent the model-fitting procedure as much as possible, since we don't have enough information --and IMHO datapoints-- to build a complicated model.

Edit: Following Whuber's suggestion I've taken the square root transformation to symmetricize the residuals.

looking at the outliers, i don't really see a seasonality --below, for illustration, i'm carrying the analysis using R, the open source statistical software

library("robfilter")
dta<-c(2, 1, 4, 5, 4, 8, 7, 11, 4, 4, 11, 7, 10, 7, 42, 19, 13, 13, 11, 9, 8, 16, 10, 12, 9, 7, 21, 9, 10, 6, 7, 19, 18, 9, 19 ,15, 14, 17, 9, 10 ,10, 13, 15, 20, 15, 12, 15, 16 ,20, 17, 21 ,19, 8, 16, 11, 12, 16, 10, 5, 18, 13, 18, 16, 7, 12, 12, 17, 17, 7, 14, 15 ,10, 13, 15, 11, 13, 10, 9, 11, 11 ,10, 8, 24, 13, 18, 8, 8 ,13, 9 ,7, 6, 14, 17 ,7, 13, 9, 11, 19, 8 ,9, 13, 11, 14, 5, 8, 8, 13, 12 ,20, 9, 18 ,13, 13, 10 ,6 ,9, 8, 8)
mod4a<-robreg.filter(y=sqrt(dta),width=12,method="MED",h=7,minNonNAs=5,online=TRUE,extrapolate=FALSE)
resds<-abs(c(rep(sqrt(dta[1]),11),na.omit(mod4a$level[,1]))-sqrt(dta))
mod4b<-robreg.filter(y=resds,width=12,method="MED",h=7,minNonNAs=5,online=TRUE,extrapolate=FALSE)
otl<-which(resds/mod4b$level[,1]>3) #time of the outliers:
>otl
[1]  15  32  53  59  83  85 104 109

fit of the series, with outliers marked in green

user603
  • 21,225
  • 3
  • 71
  • 135
  • 1
    This is the right idea, because (a) there is a trend but it's not easily characterized and (b) there are no significant serial correlations at any lag. However, loess will do a much better job than a median filter at *characterizing* these data. All this begs the question of *why* the OP is fitting the data: median filters or loess will do little for predicting future values, for instance. – whuber Jul 23 '12 at 20:33
  • 1
    @whuber: --this is a one sided filter: as far as i understood the option "online" makes sure it doesn't use data from $t+i$, $i>0$ at time $t$. More generally, I agree with you: I also tough of asking the OP what was the end purpose (is he, for example, interested in the value of an ar coefficient for a given lag)? – user603 Jul 23 '12 at 20:38
  • @user602: I want to predict data – Damien Jul 23 '12 at 20:41
  • 1
    Good point about the potential online nature (+1). Another mild improvement can be achieved by analyzing the square roots of the data (because these evidently are counts). Alternatively--for sophisticated analysts--a Poisson GLM with splines or changepoints would do a fine job. – whuber Jul 23 '12 at 20:41
  • @Damien: yes, as long as you use "online=TRUE" this approach can be used for forecasting (we only use the past). The final forcast for the next period is 11...not very different from IrishStats's forcast. – user603 Jul 23 '12 at 21:00
  • @Damien The AUTOBOX forecast is 10.8572 ( the robust/outlier adjusted mean of the last 44 values ) – IrishStat Jul 23 '12 at 21:16
  • @user603: Where is the final forecast for the next period? – Damien Jul 23 '12 at 21:34
  • @Damien: it's the last entry of mod4a$level[,1] raised to the power 2 -- since this is based on a model for the square root of your data--. – user603 Jul 23 '12 at 21:37
  • @user603: Also I don't see how i have >100 data points. I only had 37 data points – Damien Jul 23 '12 at 21:44
  • @user603: Sorry I realized that you were working off of the old data which had more data. – Damien Jul 23 '12 at 21:47
  • @user603: This has 47 and I just used your example for it. Thanks: 2 1 4 5 4 8 7 11 4 4 11 7 10 7 0 19 13 13 11 9 8 16 10 12 9 7 21 9 10 6 7 19 18 9 19 15 14 17 9 10 10 13 15 20 15 12 15 16 – Damien Jul 23 '12 at 21:47
  • it had 48, and the next period forecast would be 14.5 :). But you can do the whole analysis for yourself: R is open source, free and so is the robfilter library and furthermore the methodology is fully explained in the peer reviewed papers referenced in the package's documentations. In a word, it's not a black box and your are encouraged to play with it. – user603 Jul 23 '12 at 21:51
  • @user602: It seems that all the value in mod4a$level[,1] are the square root of the forecasted values up to the last data point we have. But if we wanted to extrapolate, we could just change extrapolate = TRUE to get the next prediction? – Damien Jul 23 '12 at 21:55
  • @Damien: no: extrapolate=TRUE only concerns the data for which we don't have a model. Since online is TRUE, the extrapolation only affects the first 11 observations for which we don't have a model --and which would otherwise be coded as NA--. – user603 Jul 23 '12 at 21:59
  • @user603: How would we get the next predicted observation after the last observed one? What about the next one after that....etc...? 14.5 is just the predicted value at t=48. But we already have observed this value. What about the predicted value at t=49? – Damien Jul 23 '12 at 22:01
  • @Damien: as with all models that allow level shifts, there is no simple linear expression for $y_{t+k}|y_t$: you have to recursively fit the forecast in the model to get a new forecast. After 12 periods the forecast will be a constant.because the model only uses the last 12 observations to build a forecast. This is the "width" parameter. But again, this is pretty simple to do in R from the code i posted (it's just a loop). – user603 Jul 23 '12 at 22:05
  • @user603: The model uses 1-12 to get 13, 2-13 to get 14, etc...? – Damien Jul 23 '12 at 22:22
  • 1
    yes. But again, you are encouraged to go read the references quoted in the [manual](http://cran.r-project.org/web/packages/robfilter/robfilter.pdf) – user603 Jul 23 '12 at 22:32
  • @user603: Would you say median filters are good to detect any violations of apparent trends in a dataset? – Damien Jul 23 '12 at 23:14
  • @Damien: i'm not even sure what the question means. The important property of the recursive median in this context is that it bulges very little [--minimally among all estimator of central tendency, in fact--](http://projecteuclid.org/DPubS/Repository/1.0/Disseminate?view=body&id=pdf_1&handle=euclid.aoms/1177703732), when up to width/2-1 of the last width observations have been replaced by arbitrarily data. – user603 Jul 23 '12 at 23:22
  • @user602: So we are using 36 windows of size 12 to get the forecasts. To get the projected values for the next 10 time periods, we would need to use 46 windows? – Damien Jul 23 '12 at 23:46
  • No: but i think you are confused about basic time series stuff. Try to look for "moving window estimation". Otherwise ask a separate questions. – user603 Jul 23 '12 at 23:48
0

It would also help if you set your data as a time-series such as:

1. Make a R timeseries out of the rawdata: specify frequency & startdate

gIIP <- ts(Trimmer, frequency=12, start=c(2005,11)) print(gIIP) plot.ts(gIIP, type="l", col="blue", ylab="Title of Chart", lwd=2, main="Full data") grid()

Jim Johnson
  • 63
  • 1
  • 4
  • 2
    This seems to be code declaring a time series and plotting it. That's already been done. – Nick Cox Feb 10 '15 at 17:43
  • Operating on time-series (ts) objects in R is useful advice, indeed, but this answer doesn't address the OP's question(s) at all. – Graeme Walsh Feb 10 '15 at 18:53