3

I have time-series data, let's say a pandas series, with time (sampling frequency is hourly) as its index and temperature measurement across that time. I want some statistical/time-series principle which can tell whether a time-series is well-behaved or not.

What I mean by well behaved time-series is that, let's say the distribution of temperature for a day is same/almost identical for all 7 or even 30 days of the month. The reason for detecting even a slight deviation is to know whether some sensors that collect temperature are working properly or not. The device, whose temperature sensors are measuring every hour, has the property that it's temperature distribution for the whole day remains almost identical throughout the month.

Sean Easter
  • 8,359
  • 2
  • 29
  • 58
lovekesh
  • 459
  • 5
  • 16
  • distribution remains same for whole month? Is this measuring temperatures of a fridge or sthg like that. – adam Jul 22 '15 at 16:06
  • well, it's just a hypothetical scenario. I can't really talk about why i need this but i think the description specifies my problem very closely, so i need some help. – lovekesh Jul 22 '15 at 18:24
  • specification of the problem often requires data. If you can't post your data... then transform your data and post the transformed data. This might help to draw out precisely what you need to do OR can't do . – IrishStat Jul 23 '15 at 21:09
  • @IrishStat I will post a dummy data very soon. Thanks again. – lovekesh Jul 24 '15 at 05:55
  • This question is so vague and general that no one will be able to tell whether any of the (extremely different) proposed answers is any good for your situation. All those that have appeared so far implicitly adopt relatively strong (but differing) assumptions about your data and about the kind of "abnormality" you are looking for. If you could be more specific about those two things you would likely get better guidance. – whuber Jul 25 '15 at 13:58

4 Answers4

2

Maybe start simple. If you are expecting distributions to be identical day to day, test each day's against the baseline (whatever you consider normal): http://www.itl.nist.gov/div898/software/dataplot/refman1/auxillar/ks2samp.htm

If you are looking to anomaly detection intraday, and you have a good model for the distribution, can you just have a probability cut-off for outliers?

gbasin
  • 46
  • 1
  • What you are suggesting is an outdated out-of-model test which is flawed by the effect of the unincorporated outliers/level shifts/seasonal pulse/local time trends while Intervention Detection is a "probability cut-off" based upon a within-model test that can lead to the emorical identification of not only pulses BUT level shifts/seasonal pulses and local time trends. – IrishStat Jul 23 '15 at 17:15
  • Agreed, but the way the problem was presented makes it sound like seasonality and level shifts are not an issue :) – gbasin Jul 23 '15 at 17:19
  • @gbasin Yes, you are right. seasonality and level shifts is not an issue. – lovekesh Jul 24 '15 at 05:54
1

Detecting the onset of unusual activity is the subject of outlier detection and nearly about every answer that i have recently made. A model reflecting period to period dependency and/or day-to=day dependency can be developed using Transfer Function/Dynamic Regression while "unusual" innovation can be detected when typical rules fail. If you wish to post your data I would be happy to take a look at it and hopefully other readers would do the same. Following is a very good thread with respect to anomaly (intervention) detection.Detecting Outliers in Time Series (LS/AO/TC) using tsoutliers package in R. How to represent outliers in equation format? . Read all the answers and comments and particularly closely follow the Tsay 1986 article http://www.unc.edu/~jbhill/tsay.pdf

IrishStat
  • 27,906
  • 5
  • 29
  • 55
  • Question is related to outlier detection in time series, your link does not provide any information on this – adam Jul 22 '15 at 16:04
  • @adam which link are you referring to ? All my links refer to time series as that is the only subject/topic I know . – IrishStat Jul 22 '15 at 16:09
  • Both links go to the same answer of yours, which is quite general and does not clearly address outlier detection at all. Since you have posted often, and sometimes in detail, about outlier detection in time series, I'm confident you could find a better reference than that! – whuber Jul 22 '15 at 16:19
  • 1
    @whuber sorry about that ... I have expanded my answer .... – IrishStat Jul 22 '15 at 18:36
  • Have you searched the literature on robust control charts? You could start [here](https://feb.kuleuven.be/public/ndbae06/PDF-FILES/Robust_Monitoring.pdf) – user603 Jul 22 '15 at 21:25
  • this is a non-starter "This article presents a control chart for time series data, based on the one-stepahead forecast errors of the Holt-Winters forecasting method." as it is specific model presumptive forecasting method – IrishStat Jul 22 '15 at 21:41
  • @IrishStat Thanks for the links. R package looks great. I will go through the module. – lovekesh Jul 23 '15 at 05:09
  • Unfortunately the free R package has some serious shortcomings . It assumes that the dominant structure is memory by identifying the ARIMA portion first whereas the deterministic portion may be dominant. Secondly and perhaps more importantly it uses a naive one step list-based procedure (AIC/BIC) to identify the ARIMA portion from a fixed set of models whereas ARIMA model is an iterative identification process. – IrishStat Jul 23 '15 at 10:47
  • @Irishstat, I figured out a way by which we can check deterministic features first and then use tsoutliers package and it works just fine. – forecaster Jul 23 '15 at 21:56
  • @forecaster That may be true but I would be further interested in performing/reporting head-to-head comparison in terms of variance reduction between the two approaches. Did you consider incorporating deterministic trend changes involiving 1,2,3,....type series or was this not part of the study ?. – IrishStat Jul 23 '15 at 22:17
  • @IrishStat I'm not sure if I'm following trend changes 1,2,3,... ? is it local trend ? – forecaster Jul 24 '15 at 01:20
  • http://stats.stackexchange.com/questions/161571/determining-order-of-arima-model-using-box-jenkins-correct-approach-argumenta/162328#162328 discusses two types of trends – IrishStat Jul 24 '15 at 01:45
  • @IrishStat, I disagree with the proposed model in the link, I would pick random walk with drift as it is more parsimonious than the proposed model from Autobox. – forecaster Jul 25 '15 at 03:14
  • Parsimony is an objective BUT model sufficiency is a higher objective. Take your model compute the errors and then take those errors and examine them for sufficiency with Intervention detection or simply plot the errors. – IrishStat Jul 25 '15 at 07:15
  • @forecaster On second thought,why don't you pose a question regarding just one of the two series under discussion.Present the data and your model and detail how it was how it was identified using a list-based procedure.Present the normal statistics for the model including tests of parameter significance and a complete analysis of the errors validating your "white noise" conclusion. I will respond using methods and procedures which you perhaps don't have access to and compare/contrast your model and the one I suggested. This might be educational and I think it will be of interest to the list. – IrishStat Jul 25 '15 at 12:13
1

I think the best method for identifying sensor problems from time series data is to test for stationarity rather than outliers or anomalies alone. Outliers are individual data points that lie outside the expected or normal range. Anomolies are patterns of data points that are somehow distinct or "not typical", even though the might be inside the normal or expected range.

In contrast, non-stationary time series is a time series where the generating distribution has changed or is changing over time. In other words, stationarity is concerned with the generating distribution and not with individual data points or groups of data points. As you said, the distributions associated with working sensors stays the same (i.e. is "stationary") over a month.

Here are a few introductory references:

The problem with outlier detection as a method is that there might be many causes of outliers not relate to faults in sensors. Same for anomalies. It might be true that some changes in stationarity might also be accompanied by either outliers or anomalies, but that is not necessarily the case. In contrast, changes in stationarity will almost always be related to faults or failures in sensors and related processes of data capture and transmission.

The downside of stationarity tests is that it is hard to detect changes in stationarity quickly in real time, with high reliability (i.e. minimum of false positives). If you might combine several methods to get "early warning signals" of possible sensor problems, and then confirm them later (hours or days) after more data comes in.

MrMeritology
  • 1,164
  • 6
  • 10
  • ARIMA models are non-stationary and easily extend to Intervention Detection – IrishStat Jul 24 '15 at 22:53
  • @IrishStat Yes, ARIMA models can be used to detect non-stationarity. I presume it has strengths and limitations, but I don't have experience with it so I won't comment further. – MrMeritology Jul 24 '15 at 23:08
1
  1. Let us assume that last K days you have measurement which you can trust and are OK.
  2. You are now interested to see if day K+1 distribution is the same as in the previous K days. To check that you can do a Two-sample Kolmogorov-Smirnov test

Example (R):

library(data.table)

set.seed(34976742)

# daily pattern
DT <- data.table(h=1:24, base = rlogis(24, 20, 2))
# number of days in history
K <- 20

# simulated historical data
historical.DT <- DT[, list(day = 1:K, t = rnorm(K, base, .5)), by = h]
# simulated test day data
new.DT <- historical.DT[, list(day = K+1, t = rnorm(1, mean(t), 1)), by = h]

# Two-sample Kolmogorov-Smirnov test
ks.test(historical.DT[, t], new.DT[,t])

Note that historical.DT[, t] is a vector of measurement ordered first by hour, then by day, while new.DT[,t] is ordered by hour.