2

I know my title seems to go against what an outlier is, but I don't know any other way of phrasing it.

Let's say I have a spreadsheet, and a column with the following values:

[From newest-to-oldest] 100, 9, 7, 100, 7, 7, 100, 9, 9, 100, 7, 9

If we were to disregard the 100's for a second, the average would seem to be 8; however, every third entry, the value seems to (reliably) jump up by 92+ the average. What I'm trying to figure out is how to predict what the next value will be given what we've seen before. So, if the last few values where:

[From newest-to-oldest] 9, 100, 7, 9, 100, 7, ect.

It should say "the next value should be 8 ± 1." If the values where instead:

[From newest-to-oldest] 9, 7, 100, 7, 7, 100, 9, ect.

It should say "the next value should be 100 ± 0." How would I go about doing this? This is all in excel (Google Spreadsheets to be more precise). I originally tried to find the outliers and throw the two sets of data into separate columns. Then, depending on the interval, choose the average from one of the two columns. However, since the 100's aren't really outliers (being that they're so common), using standard methods of finding them didn't work.

whuber
  • 281,159
  • 54
  • 637
  • 1,101
  • Do the 100s arrive at regular intervals? Or do they arrive irregularly? (In your example, every 3rd item is 100 and a remarkably simple model to forecast $y_t$ would be $\hat{y}_t = y_{t-2}$.) – Matthew Gunn Jan 26 '17 at 02:59
  • They arrive at regular intervals. Also, the upper value isn't always 100 (I just used 100 as an example. It could be 101, 99, 98, 102, but they hang steadily around their own upper mean). I know I could hardcode ŷ = yt−2 but I don't know what the interval is going to be; this is just an example column. I have dozens of columns and each have their own interval with their own upper and lower means, but the intervals are steady. I know I could go through each column and find it out by hand, but I want the spreadsheet to be dynamic. – David Perez Jan 26 '17 at 03:15
  • 1
    The first thing I would try are some basic [ARMA models](https://en.wikipedia.org/wiki/Autoregressive%E2%80%93moving-average_model). The basic AR(p) model can be estimated with linear regression of the form $y_t = a + b_1y_{t-1} + b_2 y_{t-2} + \ldots + b_p y_{t-p} + \epsilon_t$ where $p$ is the number of lags. This type of model can capture all kinds of cyclic behavior. – Matthew Gunn Jan 26 '17 at 03:19
  • If you identify a regular periodicity, I'd suggest looking at seasonal differences. FOr your above data, consider looking at $y_t-y_{t-3}$; If you have a specific set of data can you show a plot of the data in time order? – Glen_b Jan 26 '17 at 08:34
  • To me an outlier is whatever causes surprise in relation to the majority of the data, and in saying that I stand on the shoulders and sit at the feet of two excellent statisticians. For more about that definition, and much else, see http://stats.stackexchange.com/questions/78063/replacing-outliers-with-mean The ideas that outliers exist objectively and can be identified unproblematically cause more problems than they solve in my experience. – Nick Cox Jan 27 '17 at 11:56
  • Knowledge about the nature of the data may help here, for example it's very different whether you are sure that every third value belongs to another population (in which case you indeed can separate them without the need of any "outlier identification"), or whether they occur mostly as third values but sometimes elsewhere. Also how exactly your prediction is used and what kind of "loss" occurs if you get it wrong would play a role. I agree with others that time series models may help, but it is possible that the story behind the data here even suggests more regularity than such models imply. – Christian Hennig Aug 18 '21 at 09:51

1 Answers1

0

Thank you Matthew Gunn and Glen-b! Your responses led me to a YouTube video by Jalayer Academy titled "Time Series Forecasting with Google Sheets" which, to some extent, is a solution to my problem. It's not a complete solution, however, since it requires me to know what the interval will be, but I guess I can hardcode three-to-five of the most common intervals I've seen and have Excel brute force its way through the calculations, finding which of the five gives me the line of best fit.

It's not a perfect solution, but it's a solution and appreciate you guys for leading me there.

Nick Cox
  • 48,377
  • 8
  • 110
  • 156
  • I haven't looked in detail at what that guy does, but I'd certainly try a 3rd or 5th order autoregressive model. Something like: https://www.youtube.com/watch?v=bkb2vEa5Ku4 (I haven't watched this video either, but it looks like he goes over how to fit an AR(3) model in excel.) – Matthew Gunn Jan 27 '17 at 02:24
  • 1
    Fun fact: statisticians [Yule and Walker first developed an AR(2) model](http://www-stat.wharton.upenn.edu/~steele/Courses/956/ResourceDetails/YuleWalkerAndMore.htm) back in the 1920s to model the 11 year [sunspot cycle](https://en.wikipedia.org/wiki/Sunspot). – Matthew Gunn Jan 27 '17 at 02:26