1

My dataset is time-series sensor data and anomaly ratio is between 5% and 6%

1. For time-series anomaly detection evaluation, which one is better, precision/recall/F1 or ROC-AUC ?

When empirically studying this issue, I found some papers use precision/recall/F1 and some papers use ROC-AUC.

Considering that positive samples(anomalies) are relatively less than negative samples(normal points), which one is better?

I'm confused with this issue

2. If I use precision/recall/F1, should I check precision/recall/F1 only for positive class ?

I think because the number of positive samples are sparse, it's not appropriate to check precision/recall/F1 only for positive class

Thus, should I check precision/recall/F1 for both positive class and negative class?

If that's right, can I report precision/recall/F1 with macro avg in my paper?

(you can see the picture below. I used classification_report in sklearn library)

enter image description here

1 Answers1

0

Do not use accuracy to evaluate a classifier: Why is accuracy not the best measure for assessing classification models? Is accuracy an improper scoring rule in a binary classification setting? Classification probability threshold The same problems apply to sensitivity, specificity, F1, and indeed to all evaluation metrics that rely on hard classifications.

Instead, use probabilistic classifications, and evaluate these using proper scoring rules. Note that AUC is a semi-proper scoring rule, so if there is an absolute choice between this and the improper rules above, use AUC. Better: use a real proper scoring rule. See the tag wiki for more information and pointers to literature.

Stephan Kolassa
  • 95,027
  • 13
  • 197
  • 357
  • Thank you for your explanation ! If then, is it possible for me to consider precision@K and recall@K? K indicates k number of positive samples with the highest anomaly score – Dae-Young Park Jun 23 '21 at 08:20
  • Those would still be improper, so I would recommend against them. – Stephan Kolassa Jun 23 '21 at 08:34
  • Why..? I remember a paper (top tier conference) use these evaluation methods. So, I want to know why you don't recommend them – Dae-Young Park Jun 23 '21 at 08:38
  • Take a look at the first link in my answer. The same arguments (especially the ones in my answer in that thread) apply to precision and recall, and similarly to variants like the @K one. ... – Stephan Kolassa Jun 23 '21 at 08:50
  • ... Yes, inappropriate evaluation metrics are used very often, especially in fields that are closer to computer science than to statistics (which I assume you are looking at, because conferences are much more common in CS than in stats). My personal impression is that, sorry, computer scientists and many machine learners simply do not understand the statistical problems in their evaluation metrics, and prefer to use metrics that are "understandable" without thinking deeply enough about them. – Stephan Kolassa Jun 23 '21 at 08:51
  • Um.. Ok ! Thank you for your explanation ! I will check your advice via given links – Dae-Young Park Jun 23 '21 at 10:14
  • [Are unbalanced datasets problematic, and (how) does oversampling (purport to) help?](https://stats.stackexchange.com/q/357466/1352) may also be interesting to you, in particular the discussion in the comments below the question. Also, take a look at Frank Harrell's contributions here and elsewhere. Incidentally, [this is Frank on statistical problems in medicine that peer review by medical experts won't catch](https://twitter.com/f2harrell/status/1401516318818508802?s=20); I would say the exact same applies to computer scientists. Good luck! – Stephan Kolassa Jun 23 '21 at 11:35