Your first point looks at the "raw data" (confusion matrix: tabulated form) of quality of prediction measurements. There are more options than just misclassification counts:
- Predictions for different kinds (groups) of cases: training error vs. independent test cases, ...
- The confusion-matrix based errors (there are also measures here that take into account the composition of the data set for which predictions are computed), are also related to prediction error measures known from regression, such as mean absolute and root mean squared error.
[further reading]
So I regard the confusion matrix as the summary of an observation (experiment). Having observed this set of predictions, you can start to work with statistical tests, just as you can on your "actual" data, or on the model (parameters).
You may get away with reporting e.g. the observed proportion of misclassifications (and the number of test cases). However, IMHO you should really give confidence intervals as well. And that brings you to the idea that if you want e.g. to compare two models, you need to do a statistical test rather than just concluding something from the fact that for one model you observed more misclassifications than for the other:
You can do statistical tests on confusion matrices, e.g. the McNemar test for comparing predictions of two models with a paired design.
Deviance is related to mean squared error (see e.g. Gelman: Bayesian Data Analysis)
In contrast to the other measures, Z-scores are more about the model itself.
Of course, you may calculate e.g. confidence intervals for model parameters as well as for predictions.