I would definitely not use confusion matrices, misclassification rates, precision, recall or similar metrics for all the reasons explained in relevant threads, which all apply equally to ordinal multiclass problems:
Instead, I would recommend creating probabilistic predictions. If a patient has a 0.9 probability of only requiring Local Anesthesia (and the other possibilities share the remaining probability of 0.1), then our preparations should be very different than if the Local Anesthesia probability was 0.21 (with the other cases again sharing the remaining probability equally) - even though in both cases, Local Anesthesia has the highest probability, so both cases would yield the exact same entry in a confusion matrix. Correctly calibrated probabilistic predictions will be far more useful here (compare the link in the third bullet point), and proper scoring rules are precisely the tools that will reward such well-calibrated probabilistic classifications.
Here come the semi-bad news: although there is no dearth of proper scoring rules for multiclass predictions (I personally like the log score best: Why is LogLoss preferred over other proper scoring rules?), there is apparently precisely zero work on proper scoring rules for ordinal classification. (Then again, confusion matrices and associated KPIs, per above, also do not take the ordinal relationship into account.) On the one hand, as far as I can tell, that is completely fine, because the math for the categorical case carries over to the ordinal case, so any proper scoring rule for categorical predictions should stay proper for ordinal ones, and you can use the log or Brier score "as-is". On the other hand, it feels like leveraging the ordinality should be something we should do. If the true outcome is Local Anesthesia, then a probabilistic prediction of $\hat{p}_1 = (0.7,0,0,0,0.3)$ will contribute the exact same amount to the log score and to the Brier score as $\hat{p}_2 = (0.7,0.3,0,0,0)$, although $\hat{p}_2$ is of course a much better prediction than $\hat{p}_1$ - no matter that our proper scoring rules will draw us towards correctly calibrated predictions in expectation.
A wrong idea
One possibility might be to use a modification of the multiclass Brier score, which is
$$B=\frac{1}{N} \sum_{t=1}^{N} \sum_{i=1}^{R} (\hat{p}_{ti} - o_{ti})^2$$
for $N$ outcomes with $R$ possible classes, where $\hat{p}_{ti}$ is the probabilistic prediction for instance $t$ to be of class $i$, and $o_{ti}$ is an indicator variable that is $1$ if instance $t$ is of class $i$ and $0$ otherwise. We could include the distance between the predicted and the actual class (call this $a_t\in\{1, \dots, R\}$, where we assume classes to be ordered $1<\dots<R$) as follows:
$$\tilde{B}=\frac{1}{N} \sum_{t=1}^{N} \sum_{i=1}^{R} |a_t-i|\cdot(\hat{p}_{ti} - o_{ti})^2.$$
This penalizes a high probabilistic prediction (high $\hat{p}_{ti}$) for a wrong class ($a_t\neq i$) more strongly if the wrong class is farther away from the true class (large $|a_t-i|$). Of course, predictions for the correct class now receive a weight of $|a_t-i|=0$, so we do not reward high correct $\hat{p}_{ti}$ any more - but that should be fine, since a high correct $\hat{p}_{ti}$ necessarily implies low incorrect $\hat{p}_{ti}$. Overall, it seems intuitively to me that this "distance-weighted Brier score" should still be proper, and address the issues with the two examples in the previous paragraph (but I emphasize that I do not have a formal proof). Incidentally, note that we can use other distances between our ordered classes.
EDIT: it turns out this idea does not work. Specifically, this "distance-weighted" Brier score is not proper any more, let alone strictly proper. As a counterexample, consider $R=4$ possible classes ordered $1<2<3<4$, and assume the true probabilities are $p=(0.5,0.3,0.2,0)$. We compare the calibrated probabilistic prediction $\hat{p}=p$ and a miscalibrated prediction $\hat{p}'=(0.5,0.3,0.1,0.1)$. Their expected "distance-weighted" Brier scores are:
$$ \begin{align*}
E\tilde{B}(\hat{p}) =&\; 0.5\cdot(1\cdot 0.3^2+2\cdot 0.2^2) \;+ \\
&\; 0.3\cdot(1\cdot 0.5^2+1\cdot 0.2^2) \;+ \\
&\; 0.2\cdot(2\cdot 0.5^2+1\cdot0.3^2) \\
=&\; 0.290
\end{align*} $$
and
$$ \begin{align*}
E\tilde{B}(\hat{p}') =&\; 0.5\cdot(1\cdot 0.3^2+2\cdot 0.1^2+3\cdot 0.1^2) \;+ \\
&\; 0.3\cdot(1\cdot 0.5^2+1\cdot 0.1^2+2\cdot 0.1^2) \;+ \\
&\; 0.2\cdot(2\cdot 0.5^2+1\cdot 0.3^2+ 1\cdot 0.1^2) \\
=&\; 0.274.
\end{align*} $$
That is, the "distance-weighted" Brier score of the miscalibrated prediction $\hat{p}'$ is smaller in expectation than for the calibrated prediction $\hat{p}=p$. The "distance-weighted" Brier score is thus not proper.
I apologize for any confusion. It seems like using the "standard" Brier score is still our best bet, even for ordered data.