Test overfitting of logistic regression with limited volume

Question

I have a set of samples with two labels red and black. I can build a logistic regression model to predict the label colour. Once a model is built, I would like to test whether it is overfitting or not.

Normally, I will set aside, say, 30% of my sample as out-of sample. Build a logistic regression model on the 70% development sample and test its performance (for example, Gini) on the out-of sample. If I see Gini drops from development sample to out-of sample, I know that the model may be overfitted.

However, when my sample has small number of red (or black or even both), it is reluctant to set aside some out-of sample for validation purpose. I'd rather use as much data as possible given a limitation. So what are some effective validation methods other than the one I describe above that can be used?

Note that here I am not trying to determine the best model form (as logistic regression is the one to use), or to determine best value of some tuning parameters (I don't need to determine the number of variables to use). I simply have got a model and want to test whether it is overfitting the training sample or not. So I don't think cross-validation would be applicable here.

You seem to have got hold of the wrong end of the stick about cross-validation. Its primary use *is* to estimate the out-of-sample performance of your model selection/fitting procedure. Secondary uses include comparison of different models or optimization of tuning parameters. — Scortchi - Reinstate Monica, May 09 '14 at 14:21
See Steyerberg et al (2001), "Internal validation of predictive models: Efficiency of some procedures for logistic regression analysis", *Journal of Clinical Epidemiology*, **54**, pp 774–781 for a comparative study. — Scortchi - Reinstate Monica, May 09 '14 at 14:39
And note the Gini coefficient is equal to $2c-1$, where $c$ is the concordance measure defined in the paper. — Scortchi - Reinstate Monica, May 09 '14 at 14:48
Thanks everyone. Let me clarify something. The reason to set aside out-of sample is partly to verify whether I overfit the data and more importantly to get the real model performance. As we know, the model performance (in turns of logistic regression, Gini is mostly used) shouldn't be taken from the training dataset. However, when the total sample size is limited, I don't want to set aside any data for the purpose of valuate my trained model on an unseen dataset to get the true performance. I am after some alternatives to do this without wasting valuable limited data. Thank you. — Steve, May 18 '14 at 08:54
The usual cross-validation & bootstrapping approaches are described in the paper (& many other places), @Frank Harrell's answer gives a useful heuristic; I'm uncertain whether you think these are what you want or not. — Scortchi - Reinstate Monica, May 18 '14 at 12:48

score 3 · Accepted Answer · answered May 09 '14 at 14:24

You are right to be concerned about data splitting. I have found that data splitting is volatile unless $n > 17000$. The easiest way to ascertain the likely amount of overfitting is to compute the heuristic shrinkage estimator $\hat{\gamma}$ (vanHouwelingen and le Cessie Stat in Med 9:1303; 1990). Specify $p$, the effective number of parameters examined against $Y$. $p$ equals the number of all terms examined, not just kept in the final model. This includes main effects, nonlinear terms, interactions, and variables tested then dropped. Then get the likelihood ratio $\chi^2$ statistic (LR) from your model. $\hat{\gamma} = \frac{LR - p}{LR}$. This estimates the slope of the calibration curve were you to apply the model to new data. $\hat{\gamma} = .9$ would, loosely speaking, indicate 10% overfitting, or 10% of what you learned from the modeling process was just noise.

Test overfitting of logistic regression with limited volume

1 Answers1

Linked