Precision, Recall and/or F1? Which should I use? or something different?

Question

I am trying to use tensorflow to predict a decision based on a timeseries dataset.

I have three classes: "Wait", "Fowards", "Backwards"

The dataset is high imbalanced ~90% of the time the decision is to "Wait". Thus using accuracy as a metric is not useful.

I need my model to focus on correctly identifying a pattern that is either "Fowards" or "Backwards", and so I have implemented the following metric to look at Precision and Recall of the classes I deem relevant.

metrics=[tf.keras.metrics.Recall(class_id=1, name='Bkwd_R'),tf.keras.metrics.Recall(class_id=2, name='Fwd_R'),tf.keras.metrics.Precision(class_id=1, name='Bkwd_P'),tf.keras.metrics.Precision(class_id=2, name='Fwd_P')]

On the understanding that they calculate per class.

Precision = TP/TP+FP

Recall = TP/TP+FN

I know the formula for F1 but I don't really understand what it is representing, so I am not sure if I should use this?

F1 Score = 2*(Recall * Precision) / (Recall + Precision)

or should I be using some other type of metric?

For my predicitons, the focus is to correctly identify "Fowards" or "Backwards" amongst the noise of "Wait"s.

It would be costly to incorrectly identify "Backwards" as "Fowards" or the other way around, but not so costly to have either identified as "Wait"s, or "Wait"s identified as either of the other two.

I believe (hope) Dave will chime in soon about proper scoring rules and how class imbalance isn’t a problem. If not, you can still search for these topics on the site. — Arya McCarthy, Jul 30 '21 at 12:20
@AryaMcCarthy - fear not. When Dave is off, I take up the slack. We provide a round-the-clock drumbeat for better practices. — Stephan Kolassa, Jul 30 '21 at 13:19
Don't use accuracy, precision, recall, sensitivity, specificity, or the F1 score. Every criticism at the following threads applies equally to all of these, and indeed to all evaluation metrics that rely on hard classifications: [Why is accuracy not the best measure for assessing classification models?](https://stats.stackexchange.com/q/312780/1352) [Is accuracy an improper scoring rule in a binary classification setting?](https://stats.stackexchange.com/q/359909/1352) [Classification probability threshold](https://stats.stackexchange.com/q/312119/1352) — Stephan Kolassa, Jul 30 '21 at 13:19
Instead, use probabilistic classifications, and evaluate these using [proper scoring rules](https://stats.stackexchange.com/tags/scoring-rules/info). On class balance, see [Are unbalanced datasets problematic, and (how) does oversampling (purport to) help?](https://stats.stackexchange.com/q/357466/1352) — Stephan Kolassa, Jul 30 '21 at 13:19
That was some heavy reading for me coming from 0 experience in this field. I don't know if I have fully understood the recommendations. **1)** From what I understand, the use of accuracy is wrong which I beleive is the same reason I originally decided for recall and precision of only the minority 2 classes, but I don't think I grasped why these are also invalid vs proper scoring rules. **2)** Am I right that you are specifically saying to use "**proper** scoring rules" vs of "scoring rules"? **3)** Given the last paragraph in my question, is there a recommended scoring, or how can I find out? — Panda, Aug 02 '21 at 10:25
[Frank Harrell describes the log-loss as the gold standard](https://www.fharrell.com/post/class-damage/), due to its relationship to maximum likelihood estimation. Other strictly proper scoring rules exist, but it might help if you can explain why that might not work for your task. // Regarding the comments last week, as much as I appreciate it, I have to laugh at me being mentioned as the authority on this, since [I learned about proper scoring rules on here from @StephanKolassa](https://stats.stackexchange.com/a/493930/247274) (among a few others). — Dave, Aug 03 '21 at 16:50
Yes, the "proper" is crucial. Any mapping that maps a probabilistic prediction and an actual outcome to a score is a *scoring rule*, but a *proper* scoring rule is one that is optimized (in expectation) by the true density. So you really want to use proper scoring rules. [See the tag wiki](https://stats.stackexchange.com/tags/scoring-rules/info), which you may already have read. — Stephan Kolassa, Aug 03 '21 at 21:06
As to which rule to choose, [Why is LogLoss preferred over other proper scoring rules?](https://stats.stackexchange.com/q/274088/1352) specifically presents arguments for and against the log and the Brier score. It also contains pointers to literature on how to choose a scoring rule. I personally like the log score, because it hits you on the head *hard* if something "impossible" occurs. That is, if you see an outcome you assigned a probability of zero to, the log score will be infinite. I consider this a Good Thing, others feel it's a bug. — Stephan Kolassa, Aug 03 '21 at 21:07
Finally, yes, precision and recall are also improper. Actually, they are not scoring rules at all. [This earlier thread](https://stats.stackexchange.com/a/359936/1352) is on accuracy, but the argument applies to precision and recall as well. If you want to, you could open a question here on CV to ask for a deeper explanation, and possibly link there here. I would love to promise I'll answer, but I'm really starved for CV time right now - sorry. But there *are* other people out there. Like @Dave. — Stephan Kolassa, Aug 03 '21 at 21:10

Precision, Recall and/or F1? Which should I use? or something different?

0 Answers0