4

I've got a small website and I'm investing a lot of efforts on it. The traffic is growing but still very low. I've studied engineering but my knowledge of statistics is basic.

I have put the last 70 days of data on Google Doc (link) and I wish to detect daily seasonality, level shifts, time trends in a logical way while also identifying unusual readings and not have them distort my forecast. What method should I use?

I've tried Excel with a linear, logarithmic regression. Are these forecast statistically reliable?

What is an alternative for this kind of task? (i.e. other type of regression? time series?)

enter image description here

Revious
  • 251
  • 2
  • 10
  • It's not at all clear what you're asking in your last sentence. – Glen_b Mar 12 '15 at 23:00
  • @Glen_b: I've tried to clarify the question, can you provide me any suggestion in order to make it even more clear? – Revious Mar 13 '15 at 09:07
  • 2
    There are many books entirely about introductions to forecasting, so your question would have to be (a) new to this forum (have you looked for threads on forecasting?) (b) much more specific to have much chance of acceptance. – Nick Cox Mar 13 '15 at 09:28
  • 2
    It's not remotely clear what you're asking. For example I have no idea what "statistically affordable" might mean. – Glen_b Mar 13 '15 at 09:44
  • @NickCox: yes, I've googled and found 2 questions which both lacks of answers.. But on a personal level it's hard for me to understand why people dislike beginner's question. If well formulated and leading to an information gain (http://en.wikipedia.org/wiki/Information_gain_in_decision_trees) why is hard for professional to accept beginner's question? – Revious Mar 13 '15 at 10:54
  • @Glen_b: sorry.. I'm not mother language, I had completely wronged the word.. a false friend. I meant if they are trustworthy, reliable. – Revious Mar 13 '15 at 10:56
  • I've tried to further clarify and improve the question. @NickCox: if you have not one book, but one article, I will be happy of reading it. But still I think that helping beginners is something positive! – Revious Mar 13 '15 at 11:03
  • 3
    It's still not completely clear what you're asking, I'm afraid. But your comments to Nick are (a) inappropriate and (b) factually wrong. His comments have nothing to do with the attitude of experts to beginners questions, but to do with stackexchange policy. (Well-formulated beginners questions generally fare very well here, but such questions are rare. Nick in particular puts in a great deal of effort to help beginners to improve their questions.) ...(ctd) – Glen_b Mar 13 '15 at 11:11
  • 2
    (ctd)... His suggestion to look at beginners books on forecasting is reasonable because your question was very badly formulated -- some more information would help you understand *how* to formulate your question better. (I'd suggest Rob Hyndman's short online book myself. https://www.otexts.org/fpp) – Glen_b Mar 13 '15 at 11:11
  • 2
    Your question is better than it was; I've made some edits, but I think it still needs some improvement. You might break it into several questions -- that may help. For example, one question could be to ask about risks of relying on forecasts based on fitting a linear trend (a regression on time), such as you depict. A second question about forecasting a series like the one that you show might still close as too broad, but might be okay if worded to invite relatively specific answers. A quick look at a text on forecasting could help focus that sort of question to be more specific. – Glen_b Mar 13 '15 at 11:25
  • 1
    The OP admittedly is not a native English speaker and as he states not a statistics speaker. I suggest that he simply rephrase his question in the following way. I have 70 days of data and wish to detect daily seasonality, level shifts , time trends in a logical way while also identifying unusual readings and not have them distort my forecast. What method should I use ? – IrishStat Mar 13 '15 at 11:30
  • 1
    On the 'fitting a trend' question, [this page](https://www.otexts.org/fpp/4/8) - especially the section on spurious regression, as well as [this question](http://stats.stackexchange.com/questions/133155/how-to-use-pearson-correlation-correctly-with-time-series/) and [this one](http://stats.stackexchange.com/questions/5173/resources-for-learning-about-spurious-time-series-regression) might be good places to start reading. – Glen_b Mar 13 '15 at 11:36
  • 1
    @Glen_b All three of your citations deal with causal modelling while his problem is univariate i.e. no user-specified predictors Thus in my opinion they are not as relevant as you might have suggested.. – IrishStat Mar 13 '15 at 11:42
  • On the 'fitting a trend' question, [this page](https://www.otexts.org/fpp/4/8) - especially the sections on linear trend and on spurious regressions, might be good places to start reading. – Glen_b Mar 13 '15 at 11:42
  • @IrishStat Sorry, I deleted that before I saw you had responded. Your comment still relates to the edited version I posted after your comment though. I think given that the OP is considering regression as a way of predicting nonstationary time series, it's definitely worth raising. – Glen_b Mar 13 '15 at 11:43
  • 1
    Revious -- there's some relevant discussion [here](https://en.wikipedia.org/wiki/Unit_root) as well, relating to the distinction between deterministic trend (such as you fitted) and a different sort of possibility that would be handled with different models. IrishStat's suggested wording would be one way to word what I was suggesting earlier would be a second question. – Glen_b Mar 13 '15 at 11:55
  • 3
    @Glen_b has lucidly explained our attitudes here, so I need not expand much on that. I choose not to take offence at your comments as (generally) I have contributed positively to this forum and (specifically) I did point to how your question needed improvement to be acceptable. Taking comments personally and making personal comments are both best avoided. But specifically with this dataset no method will tell you reliably how far your change is part of a seasonal cycle and how far part of a trend. However, subject-matter knowledge may help there. – Nick Cox Mar 13 '15 at 12:07
  • @NickCox: thanks. I appreciate your new answer and also that you didn't take my comment personally. Now I'm reading about Unit Root.. really not easy but interesting. – Revious Mar 13 '15 at 13:40
  • Iìve read the answer on Unit Root on StackExchange. It's really hard for me in any case. But maybe this document is even more clear: http://www.econ.ku.dk/metrics/Econometrics2_05_II/Slides/08_unitroottests_2pp.pdf – Revious Mar 13 '15 at 17:26
  • 1
    @Glen_b for what it's worth I actually couldn't find a good comprehensive forecasting question here. However the question looked before, it looks good now and I'm tempted to use this as an excuse to dust off my forecasting knowledge to try and write a complete answer – shadowtalker Mar 14 '15 at 12:05
  • The one thing I don't get is what a "linear logarithmic regression" is in this case. Log(traffic) vs time? – shadowtalker Mar 14 '15 at 12:06
  • Well, there's [this](http://stats.stackexchange.com/q/140163/36229) – shadowtalker Mar 14 '15 at 12:09
  • @ssdecontrol I think a reasonable answer here would be useful. – Glen_b Mar 14 '15 at 13:26

1 Answers1

3

Statistics can be loosely defined as the practice of converting data to information. The original data may have "unusual" i.e. non-typical values that often obfuscate the routine identification of a useful model. The idea here is to separate the data into signal and error i.e.deviations from the signal. Now these deviations can often be divided into typical deviations and exceptional deviations. The "cause" of the exceptional deviations can reflect one-time anomalies , seasonal anomalies and/or level shifts/time trends reflecting a set of consistent anomalies. Intervention Detection schemes suggested by many including http://www.unc.edu/~jbhill/tsay.pdf and http://www.autobox.com/cms/index.php/blog/entry/build-or-make-your-own-arima-forecasting-model enable the identification of "exceptional deviations". If we however review the graph of the 70 daily values we immediately can suggest that 4 values (days 26,33,34 and 62 ) may be candidates for possible adjustment prior to model identification:

The plot of the original data is ! enter image description here

day actual adjusted;;;

26 48 19 ; 33 87 36 ; 34 50 36 ; 62 56 38 ;

This step can of course be automated but for our discussion here I will simply suggest for purposes of education a step-by-step approach. A plot of the outlier-adjusted data is enter image description here . Now if we compute the acf of this adjusted data we obtain enter image description here suggesting that a simple ARIMA (1,0,0)(0,0,0) might be appropriate ( N.B. this was further suggested by the PACF showing a spike at lag 1 ). If we proceed to estimate this ARIMA model in a robust manner we obtain enter image description here an AR(1) with 4 pulses for these 70 values.

We now inspect the acf of the residuals from this model and obtain what perhaps might be clues/evidence of either a sufficient model or a model needing further tweaking/augmentation.enter image description here suggesting the need to either add a seasonal ar component (stochastic) culminating in (1,0,0)(1,0,0)7 OR adding seasonal/daily dummies to reflect a deterministic component. Diagnostic checking in a manual i.e. non-automatic manner as to which remedy is best requires simply trying both ways. If we use the seasonal ar augmentation we obtain the following model enter image description here with actual/fit/forecast enter image description here . If we chose a model that contains a set of daily dummies enter image description here to characterize the seasonal/daily effect we obtain the following model enter image description here with actual/fit/forecast enter image description here . The statistics suggest that either model may be adequate suggesting that both should be considered. In closing the totally automatic solution (not shown here) preferred the second approach. Model identification using one statistic like the AIC/BIC simply doesn't due justice to the intelligent design of model form,but that is just my opinion. Recall all models are wrong but these two final models seem useful. Note also the fairly broad confidence limits suggesting possibly a different level.

RE: Nick's comment

@NickCox The ar(1) term and the constant generate a long term assymptotic forecast comimg off recent values to a constant. Note that the mean of the last 38 values is 30.1 reflecting a significantly different (higher) mean from the first 32 but not suggesting continued growth. In the case of the AR(1) with the model with the three daily dummies the sum of the next 30 forecasts is 861 for an average of 28 sympathetic to and not significantly different the 30.1 . I definitely think that this is not the only model that could be used for this data.

IrishStat
  • 27,906
  • 5
  • 29
  • 55
  • Really thanks, a beautiful answer. I will try to understand it well – Revious Mar 15 '15 at 16:32
  • 1
    I could have gone on for days and days writing about regression and time series. My solution incorporated the best of both ...I would be glad to help you clear up any sub-questions that you have. If you wish you can pose another question. – IrishStat Mar 15 '15 at 20:29
  • Interesting analysis in your inimitable style. I am struggling, however, to understand why your model fits imply a tendency for the series to go down when the visual evidence is mostly the other way, even if you ignore or discount moderate outliers. Also, how far does your ARIMA modelling match the fact that we are here dealing with a counted response (which starts at zero)? – Nick Cox Mar 16 '15 at 09:54