Detecting outliers in time-series if I don't have a "normal" dataset

Question

I have been trying to detect anomalies in my time-series dataset. What I am trying to accomplish is the following: I have a multivariate dataset, where two columns are of special interest. One tells us the maximum power that should have generated and the other column tells us the actual power generated. What I am trying to do is label a datapoint as "fault" or "fault-free". I am trying to do it through a difference between theoretical maximum and the real power generated. Plotting difference demonstrates that distribution is skewed (see picture).

Most of the methods I have read about, including PCA/Mahal. distance/Neural Nets ask for normal (or fault-free) training dataset, which I cannot provide with certainty. I have also tried control charts but that didn't work.

Are there any methods that you can recommend?

Thank you

See: https://stats.stackexchange.com/questions/129274/outlier-detection-on-skewed-distributions — Jenks, Jun 26 '19 at 14:50
I was looking for more of a density estimation approach (as k-means didn't work but something similar to that) — eemamedo, Jun 26 '19 at 15:34
if you're looking at just a single value, there aren't many options because a density of a single number is 1-dimensional and therefore you are really just looking at inter-quartile ranges and things of that sort. You could also mirror the distribution into the negative and figure out what a hypothetical SD would be if it were normally distributed. — Jenks, Jun 26 '19 at 15:36
What I am trying to achieve is outlier detection based on the difference between theoretical maximum and actual power generated. So, the majority of samples will fall around 0-80 kWatts. So anomalies will be anything more than 80 kWatts in difference. I got it done manually (using pretty much "if" loop ) but I was looking at the density/clustering approach. — eemamedo, Jun 26 '19 at 15:39

IrishStat · Answer 1 · 2019-06-26T21:42:37.437

0

https://www.jstor.org/stable/2673610?seq=1#page_scan_tab_contents deals with outlier detection in a multi-endogenous setting . http://docplayer.net/12080848-Outliers-level-shifts-and-variance-changes-in-time-series.html deals with the issue in a single endogenous setting. It is important to note that any ARIMA process present needs to be SIMULTANEOUSLY identified along with the outliers.

Software that assumes no outliers and tries to identify ARIMA structure is as flawed as software that requires pre-specification of the ARIMA structure before identifying outliers.

If you feel comfortable with the ARIMA specification you might use an r program if NOT then look at http://www.autobox.com

edited Jun 26 '19 at 21:42

answered Jun 26 '19 at 15:04

IrishStat

27,906
5
29
55

In the sense, I am not trying to identify outliers based in multivariate environment but based on the difference between maximum and actual power produced. The multivariate settings (features) will contribute equally to both normal and anomalous datapoints – eemamedo Jun 26 '19 at 15:42
Then difference the two series and use the second reference . I you wish you can post the difference and I will try and help further. – IrishStat Jun 26 '19 at 16:17
That difference is plotted in the main post. Essentially, in that dataset, we have most of the points corresponding to difference up to 100 (meaning that difference between between actual and maximum power is within 100 kWatts). What I am trying to do is to find an approach that would label the largest cluster (or bin) as normal operating points and the rest are non-normal – eemamedo Jun 26 '19 at 17:00
follow he second reference to detect anomalies ..if you wish to actually LIST the values I will try and help further . – IrishStat Jun 26 '19 at 19:08
Has your question been answered ? Ifs so accept an answer and close the question – IrishStat Jun 29 '19 at 17:36
It was not answered – eemamedo Jun 29 '19 at 17:41
"Are there any methods that you can recommend?" I recommend intervention detection to enable you to deal with non-normal data as it provides a clue to the underlying arima mechanism while dealing with anomalies or multiple means – IrishStat Jun 29 '19 at 19:59

Detecting outliers in time-series if I don't have a "normal" dataset

1 Answers1