Measuring the performance of Logistic Regression

Question

Being quite new to the field, it occurs to me that there are multiple and fundamentally different ways of assessing the quality of a logistic regression:

One can evaluate it by looking at the confusion matrix and count the misclassifications (when using some probability value as the cutoff) or
One can evaluate it by looking at statistical tests such as the Deviance or individual $Z$-scores.

Although I more or less understand the aforementioned tests, I am curious to know when it is appropriate to use the misclassification count and when one should apply more statistical tests?

duplicate of [this one?](http://stats.stackexchange.com/questions/3559/which-pseudo-r2-measure-is-the-one-to-report-for-logistic-regression-cox-s) — user603, May 15 '13 at 02:13
@user695652 Since the linked question only considers $R^2$ tests, i.e. only the second bullet of my question I don't think it's a duplicate — user695652, May 15 '13 at 02:36

score 2 · Answer 1 · answered May 15 '13 at 07:45

2

Although a misclassification table sounds intuitive, you need to be very careful in its interpretation as these counts are sensitive to the marginal distribution of the dependent variable. Consider the simplest form a logitstic regression: a logistic regression with just a constant and no explanatory variables. If you are modeling a rare event, say something affects only 1% of the population, then this model will correctly classify 99% of the observations. Does that mean that this is a good model? If you add variables to this model than the proportion correctly classified won't change much as there just is not any room for improvement. Does that mean that the constant only model is prefered?

answered May 15 '13 at 07:45

Maarten Buis

19,189
29
59

2

The deviance has the same property. If we take the same case of 99% and 1% we get deviance of $-2n(0.99\log(0.99)+0.01\log(0.01))=0.112n$. Similarly for a 50-50 split we have $2n\log(2)=1.386n$ Which shows the dependence on the marginal distribution. Same model (constant) gets different deviance. – probabilityislogic May 15 '13 at 09:32
1

@probabilityislogic: All error measures that just use the whole confusion table share this problem. That is, IMHO all such measures should be calculated in a way that takes care of the relative frequencies of the classes in the real application (e.g. prevalence), or separate measures should be used (e.g. sensitivity/specificity if the study design influences the relative frequency of the true classes) – cbeleites unhappy with SX May 15 '13 at 11:45

score 2 · Accepted Answer · answered May 15 '13 at 10:37

Your first point looks at the "raw data" (confusion matrix: tabulated form) of quality of prediction measurements. There are more options than just misclassification counts:

Predictions for different kinds (groups) of cases: training error vs. independent test cases, ...
The confusion-matrix based errors (there are also measures here that take into account the composition of the data set for which predictions are computed), are also related to prediction error measures known from regression, such as mean absolute and root mean squared error.
[further reading]

So I regard the confusion matrix as the summary of an observation (experiment). Having observed this set of predictions, you can start to work with statistical tests, just as you can on your "actual" data, or on the model (parameters).

You may get away with reporting e.g. the observed proportion of misclassifications (and the number of test cases). However, IMHO you should really give confidence intervals as well. And that brings you to the idea that if you want e.g. to compare two models, you need to do a statistical test rather than just concluding something from the fact that for one model you observed more misclassifications than for the other:

You can do statistical tests on confusion matrices, e.g. the McNemar test for comparing predictions of two models with a paired design.
Deviance is related to mean squared error (see e.g. Gelman: Bayesian Data Analysis)
In contrast to the other measures, Z-scores are more about the model itself.
Of course, you may calculate e.g. confidence intervals for model parameters as well as for predictions.

Measuring the performance of Logistic Regression

2 Answers2