When to divide data into training & test set in logistic regression?

Question

I am using Logistic Regression in a low event rate situation.
Overall universe: 46,000
Events: 420

Conventional logistic regression models divide the data into training and test sets and compute the error rates. The final coefficients and threshold levels are chosen and a model is created.

OTH, I'm just trying to prove that so and so coefficient is significant and has positive association with the event in study. I'm not developing a model as of now. I don't focus on error rates (too many true negatives!) and chose my threshold level ~ hit rate.

Should I consider dividing my universe into 2 samples, the conventional way? With such a low event rate, I'm worried that doing this to bias my coeff. estimates.

Why would you use a measure such as error rate that is inconsistent with maximum likelihood estimation? — Frank Harrell, Apr 23 '14 at 02:35
The logistic model is a direct probability model. When estimated probabilities are available, accuracy scores should use the estimated probabilities and not whether the risks exceed some arbitrary threshold. The log-likelihood and measures derived from it (e.g., generalized $R^2$) are the gold standards. Proper scoring rules are also good (e.g., Brier score). — Frank Harrell, Apr 25 '14 at 11:56

Bruno · Answer 1 · 2014-04-23T00:35:05.157

I do not think you need to divide the set if you are interested in the significance of a coefficient and not in prediction. Cross validation is used to judge the prediction error outside the sample used to estimate the model. Typically, the objective will be to tune some parameter that is not being estimated from the data.

For example, if you were interested in prediction, I would advise you to use regularized logistic regression. This is similar to logistic regression, except for the fact that coefficients (as a whole) are biased towards 0. The level of bias is determined by a penalty parameter that is typically fine tuned via cross validation. The idea is to choose the penalty parameter that minimizes the out of sample error (which is measured via cross validation.) When building a predictive model, it is acceptable (and desirable) to introduce some bias into the coefficients if said bias causes a much larger drop in the variance of the prediction (hence, resulting in a better model for predictive purposes.)

What you are trying to do is inference. You want an unbiased estimate of a coefficient (supposedly to judge the effect that changing one variable may have on another). The best way to obtain this is to have a well specified model and a sample as large as possible. Hence, I would not split the sample. If you are interested in sampling variation, you should try a bootstrap or a jacknife procedure instead.

EDIT:

Short version: You want an unbiased model. Cross validation can help you find a good predictive model, which are often biased. Hence, I do not think cross validation is helpful in this situation.

Could you describe a bit more the difference between prediction and significance? IMO, they are very similar and you can use use the p<0.05 values to predict other results. The benefit of splitting the dataset is to check the predictions you obtained with one part of the ds with another, cross-validate, and it is particularly useful for large samples. thanks — Luis, Jan 23 '20 at 18:58
@Luis: Significance is tied to a p-value. It is ultimately a (frequentist) statistical test. You can use a model (like linear regression) that has the statistical machinery required for statements about significance to do prediction. Algorithms better at prediction, like gradient boosting machines, do not generally have notions of significance, but they produce better predictive results. The trade-off is that the statistical models are generally more transparent, and they allow you to do statistical inference more directly. — Bruno, Jan 27 '20 at 03:02

score 1 · Answer 2 · answered Apr 22 '14 at 23:35

(1) Split sample is likely not the conventional way to approach this problem. Obviously conventions differ by fields of research and subject area. But I don't think it is unreasonable to say that bootstrapping for optimism would be the standard here, and I think you would have to justify in some detail if you were planning on using alternative methods.
(2) You're right, you might most probably don't need to validate model if you're only planning on looking at the association/coefficients. But you should know that the coefficients (and their p-values) are only valid for the pre-specified model. If you've included splines, variable selection etc. these values are inflated and might well have limited meaning. The validation process attempts to estimate the over-fitting of the model - the degree of optimism. It validates the model building process, not the model. If there is no model building - only a pre-specified model - not that useful for you. If there is model building - not unhelpful to have some estimate of how much it lead to over-fitting.

Please check Bruno's comment. You both suggest that splitting may not be required (you add: but still good to do so). Also, why is splitting the sample not a conventional way to approach this problem? Additionally, why can't I pick a simple 2-fold cross validation (i.e. simple split) vs 3-fold cross validation? — Maddy, Apr 25 '14 at 03:43
(1) I up voted Bruno's comments. seemed sensible (2) estimates on split sample are unstable unless sample size very large. So it is standard in fields that have very large data sets, but not in those that don't. (n>20k? what large is isn't clear). And if your sample is so large, getting hard to over-fit model with logistic regression (apparent performance=oob performance). (3) For k-fold validation, 2-fold with give two estimates so not quite split sample. Choosing K has been discussed here (http://stats.stackexchange.com/questions/27730/choice-of-k-in-k-fold-cross-validation). — charles, Apr 25 '14 at 12:38

score 0 · Answer 3 · answered Apr 22 '14 at 21:35

0

Why not use cross validation, maybe with a higher X, like 10X. LOOCV might also be interesting but that could go really slowly.

You could alternatively do some kind of more fancy custom CV where you leave one of the 420 positive events out, and the same proportion of the negative events (1/420 of them to preserve the relative proportion?) out at a given round. You would then have 420 CV iterations to calculate stats on, and you only give up training on a single positive sample at each round. That way you can get away with smaller training/testing splits. You could modify that to have fewer CV iterations if 420 would be too slow, maybe leave out 5 positives at a time, and 5/420 negatives?

answered Apr 22 '14 at 21:35

John St. John

101
3

Thanks. I did think about 3-fold cross validation or just 2 samples. But don't you think the training set would be too small for logistic regression model? The larger picture is: Why do we divide the set? To create a model that can predict an event. My main focus is not to build a prediction model but just to show that some coefficients are in line with a hypothesis. Though at the same time I agree that using 2 samples - train and test can prove the over all model performance, which makes coefficients & hypothesis sound solid. – Maddy Apr 22 '14 at 22:51
The second method I proposed, leaving out a single positive and the same proportion of negatives would not be much of a reduction in your training set for each iteration. Train on 419 points and 419/420*large # of negatives, and test on the remaining 1 positive and 1/420* large # negatives, see how well you do when you repeat that for all of your positives. You can then see how much significance varies and whatnot. – John St. John Apr 23 '14 at 04:54

When to divide data into training & test set in logistic regression?

3 Answers3