1

What are some useful robust and scalable approaches/methods towards anomaly detection of a time series data? I am mainly looking for some practical approaches carried out using Python, R, Java, etc. Albeit, I am also looking for some pointers to research papers/thesis etc. which could be helpful in carving a solution to the mentioned question.

As of now, I am currently studying Pavlidou's Thesis titled "Time series analysis of remotely-sensed TIR emission preceding strong earthquakes" as well as exploring the R packages xts, zoo, and hts.

rahulkmishra
  • 141
  • 1
  • 7
  • 1
    I don't know which kind of data you have and what anomaly means in your context, but as you used the tags `time series` and `outliers` you may be interested in the paper [Chen and Liu (1993)](http://doi.org/10.1080/01621459.1993.10594321) _Joint Estimation of Model Parameters and Outlier Effects in Time Series_. You may find a related discussion in [this](http://stats.stackexchange.com/questions/104882/) and [this](http://stats.stackexchange.com/questions/116363/) posts. – javlacalle Oct 29 '14 at 08:34
  • Data is from a power domain in which half-hourly power consumption of each consumers is being provided. Here anomaly would mean something deviant from what the consumer has been consuming in general. For example, for a particular consumer, in general if the consumption on afternoon on weekends is more when compared to weekdays but if we find lesser consumption(it could be due to external factors like, out-of-station) then it could be labelled as anomaly or deviant from normal behavior. – rahulkmishra Oct 29 '14 at 08:49
  • Then, there could be other scenarios like how a consumer becomes deviant from other similar consumers who had been similar in past. If the deviance is large , the particular consumers usage could become potential anomaly. – rahulkmishra Oct 29 '14 at 08:49
  • Have a look at the R functions (mostly with C++ backends) in the [robfilter package](http://cran.r-project.org/web/packages/robfilter/index.html) – user603 Oct 29 '14 at 09:01
  • 1
    Then your data are panel data rather than time series, you have observations for several consumers at different time points. The references I mentioned may not be straightforward to apply to your context. – javlacalle Oct 29 '14 at 09:02
  • Sincere apologies, I do not understand what is a panel data – rahulkmishra Oct 29 '14 at 09:12
  • See for example this [introduction](http://en.wikipedia.org/wiki/Panel_data) and the links and references given there. Contrary to time series, where a single individual is observed over several periods, panel data contain information for several individuals that are observed over several periods. – javlacalle Oct 29 '14 at 09:27
  • Thanks for the link. Yes, its a panel data. Can it not be treated as 'n' time-series data of 'n' consumers if treated separately as applicable to first kind of problem listed in my first comment? – rahulkmishra Oct 29 '14 at 09:36
  • If you want to study each consumer separately, then you can apply time series methods and the approach I mentioned for each series. However, detecting periods of higher or lower consumption may not necessarily mean detecting anomalies or outliers. You may capture possible differences between weekdays and weekends by means of calendar or other dummy regressor variables. – javlacalle Oct 29 '14 at 10:27
  • Thanks for the prompt reply! Weekday and weekend was a trivial example I had given. In reality, we would need to look into the seasonality and trends over hours, day of week, day of month, etc. the way we do it for typical timestamped time series data. Does it make sense? Something like ARIMA and its families could help I think. – rahulkmishra Oct 29 '14 at 10:41
  • Let us [continue this discussion in chat](http://chat.stackexchange.com/rooms/18247/discussion-between-rahulkmishra-and-javlacalle). – rahulkmishra Oct 29 '14 at 11:23

1 Answers1

1

Providing below a link to paper. The paper talks about several techniques to identify anomaly in time series data of disease.

In short anomaly detection methods in time series are of several types. You need to figure out the distribution of the data set first to identify which technique is the best appropriate for your dataset.

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3510767/

show_stopper
  • 415
  • 4
  • 14
  • 1
    I am happy to believe that the paper contains relevant material, but this doesn't qualify as a very good answer. It would be **much better** to summarize the content in terms of naming and ideally explaining various techniques. Then the link would have a useful purpose in providing back-up. Occasionally answers can just say "Read this and it answers your question" but more often it's better to aim at answers that are as self-contained as possible. – Nick Cox Oct 29 '14 at 12:13