Performance Metrics for Classification Models with an Ordinal Response Variable

Question

Suppose a hospital wants to use a statistical classification model to predict what kind of surgery will be required based on some measured covariates (e.g. height, weight, age, blood pressure, etc.). Suppose there are 5 types of surgeries, in order of increasing severity (i.e. "ordinal") :

Local Anesthesia (same day release)
Full Anesthesia (same day release)
Overnight Stay
48 Hours Monitoring
Long Term Monitoring

Suppose the researches have access to historical data and decide to fit a multi-class classification model (e.g. random forest) to this data - however, now they are interested in studying the misclassification rates of this model. In particular, they want to know "how wrong the model was when it makes a mistake?".

For example:

Case 1: If the patient actually required a Local Anesthesia Surgery and the model predicted Overnight Stay

vs.

Case 2: If the patient actually required a Local Anesthesia Surgery and the model predicted Long Term Monitoring

Even though in both cases the model prediction was incorrect, the prediction in Case 1 was closer to the truth compared to Case 2 : In Case 1, the model was off by 2 levels whereas in Case 2, the modal was off by 4 levels.

My Question: Although it would be relatively straightforward to build a variant of a confusion matrix that showed how severe the misclassification rates are, provided the model prediction - are there any common metrics that can be used to study this? Is this a common modelling practice?

Thanks!

Stephan Kolassa · Answer 1 · 2021-12-19T16:46:35.237

I would definitely not use confusion matrices, misclassification rates, precision, recall or similar metrics for all the reasons explained in relevant threads, which all apply equally to ordinal multiclass problems:

Instead, I would recommend creating probabilistic predictions. If a patient has a 0.9 probability of only requiring Local Anesthesia (and the other possibilities share the remaining probability of 0.1), then our preparations should be very different than if the Local Anesthesia probability was 0.21 (with the other cases again sharing the remaining probability equally) - even though in both cases, Local Anesthesia has the highest probability, so both cases would yield the exact same entry in a confusion matrix. Correctly calibrated probabilistic predictions will be far more useful here (compare the link in the third bullet point), and proper scoring rules are precisely the tools that will reward such well-calibrated probabilistic classifications.

Here come the semi-bad news: although there is no dearth of proper scoring rules for multiclass predictions (I personally like the log score best: Why is LogLoss preferred over other proper scoring rules?), there is apparently precisely zero work on proper scoring rules for ordinal classification. (Then again, confusion matrices and associated KPIs, per above, also do not take the ordinal relationship into account.) On the one hand, as far as I can tell, that is completely fine, because the math for the categorical case carries over to the ordinal case, so any proper scoring rule for categorical predictions should stay proper for ordinal ones, and you can use the log or Brier score "as-is". On the other hand, it feels like leveraging the ordinality should be something we should do. If the true outcome is Local Anesthesia, then a probabilistic prediction of $\hat{p}_1 = (0.7,0,0,0,0.3)$ will contribute the exact same amount to the log score and to the Brier score as $\hat{p}_2 = (0.7,0.3,0,0,0)$, although $\hat{p}_2$ is of course a much better prediction than $\hat{p}_1$ - no matter that our proper scoring rules will draw us towards correctly calibrated predictions in expectation.

A wrong idea

One possibility might be to use a modification of the multiclass Brier score, which is

$$B=\frac{1}{N} \sum_{t=1}^{N} \sum_{i=1}^{R} (\hat{p}_{ti} - o_{ti})^2$$

for $N$ outcomes with $R$ possible classes, where $\hat{p}_{ti}$ is the probabilistic prediction for instance $t$ to be of class $i$, and $o_{ti}$ is an indicator variable that is $1$ if instance $t$ is of class $i$ and $0$ otherwise. We could include the distance between the predicted and the actual class (call this $a_t\in\{1, \dots, R\}$, where we assume classes to be ordered $1<\dots<R$) as follows:

$$\tilde{B}=\frac{1}{N} \sum_{t=1}^{N} \sum_{i=1}^{R} |a_t-i|\cdot(\hat{p}_{ti} - o_{ti})^2.$$

This penalizes a high probabilistic prediction (high $\hat{p}_{ti}$) for a wrong class ($a_t\neq i$) more strongly if the wrong class is farther away from the true class (large $|a_t-i|$). Of course, predictions for the correct class now receive a weight of $|a_t-i|=0$, so we do not reward high correct $\hat{p}_{ti}$ any more - but that should be fine, since a high correct $\hat{p}_{ti}$ necessarily implies low incorrect $\hat{p}_{ti}$. Overall, it seems intuitively to me that this "distance-weighted Brier score" should still be proper, and address the issues with the two examples in the previous paragraph (but I emphasize that I do not have a formal proof). Incidentally, note that we can use other distances between our ordered classes.

EDIT: it turns out this idea does not work. Specifically, this "distance-weighted" Brier score is not proper any more, let alone strictly proper. As a counterexample, consider $R=4$ possible classes ordered $1<2<3<4$, and assume the true probabilities are $p=(0.5,0.3,0.2,0)$. We compare the calibrated probabilistic prediction $\hat{p}=p$ and a miscalibrated prediction $\hat{p}'=(0.5,0.3,0.1,0.1)$. Their expected "distance-weighted" Brier scores are:

$$ \begin{align*} E\tilde{B}(\hat{p}) =&\; 0.5\cdot(1\cdot 0.3^2+2\cdot 0.2^2) \;+ \\ &\; 0.3\cdot(1\cdot 0.5^2+1\cdot 0.2^2) \;+ \\ &\; 0.2\cdot(2\cdot 0.5^2+1\cdot0.3^2) \\ =&\; 0.290 \end{align*} $$ and $$ \begin{align*} E\tilde{B}(\hat{p}') =&\; 0.5\cdot(1\cdot 0.3^2+2\cdot 0.1^2+3\cdot 0.1^2) \;+ \\ &\; 0.3\cdot(1\cdot 0.5^2+1\cdot 0.1^2+2\cdot 0.1^2) \;+ \\ &\; 0.2\cdot(2\cdot 0.5^2+1\cdot 0.3^2+ 1\cdot 0.1^2) \\ =&\; 0.274. \end{align*} $$ That is, the "distance-weighted" Brier score of the miscalibrated prediction $\hat{p}'$ is smaller in expectation than for the calibrated prediction $\hat{p}=p$. The "distance-weighted" Brier score is thus not proper.

I apologize for any confusion. It seems like using the "standard" Brier score is still our best bet, even for ordered data.

Stephan's answer is definitive. Big picture: just because something is measured in categories does not mean that we should categorize future observations. Instead we quantify _tendencies_ to be in categories. — Frank Harrell, Dec 18 '21 at 13:39

Performance Metrics for Classification Models with an Ordinal Response Variable

1 Answers1

A wrong idea