0

I'm building a model to detect anomalies in for multivariate data. At first stage I run VAR model every 10 min that predicts the next values, and when I get the values I calculate the Euclidean distance. I save a list of the distances for 30 days, and if a distance I got is larger than 90 Quantile, I call it as anomaly.

What are the problems. With this method? I was told I will get an anomaly every 100 min because I run the model every 10 min, but I don't understand exactly why.

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
  • 2
    10% of the values will be larger than the 90% quantile. 10% is one in every ten values. If you sample a value once every 10 minutes then, on average, you should detect an 'anomaly' once every 10x10 minutes. – Sextus Empiricus Nov 22 '20 at 23:11
  • I used this once for outlier detection and it worked effectively. [link](https://stats.stackexchange.com/questions/228719/box-plot-notches-vs-tukey-kramer-interval) – EngrStudent Nov 23 '20 at 01:02
  • *Calling* something an "anomaly" would be a little extreme but might do no harm. *Doing* something about those "anomalies," such as adjusting them downward, downweighting them, or discarding them altogether, would be a terrible idea, though, unless you are certain that every 30 days *exactly* ten percent of all the data will be anomalously high--and such a circumstance is hardly plausible. In the alternative, you will be selectively discarding high values, which will bias everything you do with the data. – whuber Nov 23 '20 at 17:03
  • Thank you very much guys:) – Arkady Mankovsky Nov 27 '20 at 07:24

0 Answers0