Logistic regression for classification: are there any analytical solutions for the out-of-sample accuracy?

Question

I run a binary logistic regression, with a binary dependent variable and a continuous independent one.

Now I want to evaluate the out-of-sample performance of the classification algorithm so obtained. For instance, given a probability threshold, I want to compute out-of-sample accuracy or sensitivity.

One solution would be resampling-based estimators: for instance, the method of validation set. I would split the dataset into a training set and a test set, train the model on the training set, and evaluate the performance (e.g. accuracy or sensitivity) on the test set.

This is a generic approach, which can be used with any estimator. For example, I can use this approach also with SVM or neural networks.

But logistic regression is different in many ways from a (deep) neural network: in particular, we know a lot about the properties of the estimator. For instance, we know the confidence interval for the estimated coefficients, and we know an estimate for the irreducible error in the data.

My question is: are there any analytical solutions (not re-sampling based like validation set or cross-validation) to compute out-of-sample performance metrics like accuracy or sensitivity?

For instance, with least squares linear or polynomial regression, I could train the model on the entire dataset (without splitting train/test) and then get the Leave-One-Out Cross-Validation MSE with the following formula: $$ \text{CV} = \dfrac{1}{n} \sum\limits_{i=1}^n \left( \dfrac{y_i - \hat{y}_i}{1-h_i} \right)^2 $$

(see James, Introduction to Statistical Learning, pag. 180)

Is there a similar formula for the accuracy of logistic regression?

How is the linked question an answer to my question? I am asking for a formula to compute accuracy/sensitivity/etc. from the properties of the estimator, not "the bootstrap" — robertspierre, Apr 12 '21 at 05:49
I suggest you edit your question to clarify that you're looking for formulae - analytical solutions - rather than resampling-based estimators. — Scortchi - Reinstate Monica, Apr 12 '21 at 08:27
@Scortchi-ReinstateMonica I have modified the question. I hope it is now clear. I have also removed the code which was misleading — robertspierre, Apr 12 '21 at 09:30
Thanks - I think it's clear now. Might be worth specifying it's *out-of-sample* performance you want to estimate. And that the analytical solution for linear regression is *adjusted* $R^2$. Note also that the threshold on predicted probability of class membership needs to be stipulated for accuracy. — Scortchi - Reinstate Monica, Apr 12 '21 at 10:37
@Scortchi-ReinstateMonica I forgot! In case of linear or polynomial regression, we have an analytical solution for the (out-of-sample) LOOCV. A very simple one that allow to compute LOOCV MSE without actually retraining the model. That would be the best comparison I would have. I have modified the question again! — robertspierre, Apr 13 '21 at 12:36
For linear regression there is adjusted r squared, for logistic regression there is adjusted deviance r squared, that does that for deviance, I think AIC has also some connection to out-of-sample deviance. I don't know anything about similar adjustments for accuracy. — rep_ho, Apr 19 '21 at 14:49

Logistic regression for classification: are there any analytical solutions for the out-of-sample accuracy?

0 Answers0