Should I cross-validate a logistic regression model that will not be used to make predictions?

Question

Apologies if this question doesn't make much sense and if it is overly long. I had some basic stats training at university, but there are lots of gaps in my knowledge which I've gradually been trying to fill in, but I've been getting into a bit of muddle when it comes to model selection, specifically in this case about how to choose between a series of logistic regression models and whether cross-validation is necessary for non-predictive logistic regression models.

The models are based around inferring the variables that are most strongly related to the outcomes of complaints against practitioners in a profession over a three year period.

Ideally, I want to choose a relatively parsimonious model that best fits the data during the time period.
The model will not be used to make predictions on future outcomes, but to make inferences about decisions during the time period.

I've been going with a choosing a model taking into account AIC and BIC scores. I read that judging models based on AIC scores was best for using those models to make predictions, and BIC scores were better for choosing 'truer' models based upon the data used to fit the model. I thought I had developed a relatively good model based upon this approach, but am now not too sure.

My supervisor ran another model using stepwise selection, as that was how it had been done for previous research. The model produced has a relatively low BIC score in comparison to the ones I've developed, and relatively high AIC score, but is quite different in terms of a few of the independent variables it contains. I've read that stepwise selection is not a good way to select a model and would not want to go with this approach overall if possible. But then, I now feel like I need to further justify the models I produced and am not sure entirely how to do this.

Also, given how different the independent variables of lesser significance (based upon higher p-values) in the model are when comparing the stepwise model developed by my supervisor and the one I came up with, I'm thinking there may be a multitude of models that would give relatively low BIC scores and I am not sure how to choose between them.

My supervisor has also produced some pseudo R-squared measures for his model (Cox-Snell and Nagelkerke), and I have compared these to the ones I've produced, but there's no overall conclusion to this. They are all quite similar in score. I am not entirely sure whether comparing models based on pseudo R-squared measures is a good idea, and from what I've read, it doesn't seem like it would be.

I have a bit of knowledge of how cross-validation works, and how it is used to validate predictive models by holding out test sets from training sets. I'm not altogether sure however, whether this process would be needed for non-predictive models, and whether it would be useful to choose between the models I've developed if they aren't to be used for predictive purposes.

So, basically, I have a few questions:

Is cross-validation useful in choosing between inferential models that will not be used to make predictions?
If the BIC scores from a model produced by stepwise regression are relatively low in comparison to other models, would it be a useful model despite all of the drawbacks of stepwise regression?
Is comparing models from the same dataset by using pseudo R-squared scores a bad idea?
What approach would you take to choose between competing regression models that will not be used for prediction?

Hope this makes sense. Any help would be greatly appreciated.

Edit: To elaborate, there are just over 1000 cases and around 50 dummy variables to choose from, most of these variables appear to be insignificant. There are two variables that appear in every model, and then around 10-15 others that may help improve the model. The model with the lowest AIC has 11 independent variables, and the one with the lowest BIC has 9 independent variables. All of the variables are categorical.

I was hoping to go with a more parsimonious model as I thought that when making inferences and not predictions, it helps to represent a more stable representation of the data generating process, and a less bloated model when trying to explain it to non-researchers, but I'm not sure if this is right or not. I'm not sure how to phrase exactly how parsimonious the model would ideally be. I think I have previously been taught to depend on p-values too heavily, and shouldn't think so much about the statistical significance of the variables. Ideally I would like the coefficients of the variables in the model to stay relatively stable across models and improve upon model fit perhaps.

I think I'm gradually discovering about the amount of uncertainty involved in all of this and how it is also an art as well as a science.

(+1) Note that stepwise regression is a bad procedure: https://stats.stackexchange.com/a/20856/121522 — mkt, Aug 01 '19 at 14:00
Could you please say more about the scale of your data: numbers of cases, numbers of events, numbers of potential predictors, why you need a parsimonious model, and just how parsimonious you would like it to be. Those details might make a difference in an answer. Note that inference following model selection is a difficult problem; once you use the data to choose the model, the assumptions underlying usual p-value calculations don't hold. See for example Chapter 20 of [Statistical learning with sparsity](https://web.stanford.edu/~hastie/StatLearnSparsity/). — EdM, Aug 01 '19 at 17:47
Thank you for the replies. I have elaborated on the data in the original post in light of the points made. Also, thank you for the feedback and the link. I will give it a read through. — Ben M, Aug 02 '19 at 10:40
If the model is not for prediction then what is it for? (I'm not being rhetorical, there are plenty of uses, but you didn't state an objective aside from getting a "relatively parsimonious model"). — AdamO, Aug 02 '19 at 15:42

score 4 · Answer 1 · answered Aug 02 '19 at 16:09

You want to identify "variables that are most strongly related to the outcomes of complaints against practitioners in a profession," but not to predict future outcomes of complaints. Presumably, the idea is to generate hypotheses about factors that might be manipulated in future work to reduce undesirable outcomes. Cross-validation to choose a LASSO model, combined with bootstrapping to gauge the stability of the set of selected predictors, provides one useful approach to this type of problem.

LASSO is typically combined with cross validation to choose the number of predictor variables maintained in the model, based on optimizing an appropriate measure of cross-validated error; deviance is a good choice of measure for logistic regression. In practice, you might find a model similar to one developed by stepwise selection, but the penalization of regression coefficients in LASSO avoids the overfitting with stepwise selection that sends shudders up the spines of statisticians.

If the candidate predictors are correlated, those maintained by any selection scheme can be highly dependent on the particular sample at hand. So it's also important to get a sense of how stable your set of selected predictors might be if you could take more samples from the underlying population. Bootstrapping is the next best thing to taking more samples from the population. Bootstrap sampling with replacement from the data at hand is a reasonable approximation to taking more samples from the population.

So you repeat the entire LASSO model-building process, with its inherent cross-validation to choose predictors, on multiple bootstrap samples from your data set. Then you can see how frequently individual predictors are kept or omitted. That will give you an idea about which predictors might deserve the most focused future attention. That process is reasonably simple to automate with simple scripts.

An Introduction to Statistical Learning works through using cross-validation to optimize LASSO in Section 6.6.2; that example is for linear regression but the approach is the same for logistic regression, with deviance minimized. Statistical Learning with Sparsity illustrates bootstrapping to evaluate the stability of predictors chosen by LASSO and their coefficient values in Section 6.2.

As I noted in a comment, the issue of traditional inference (p-values, confidence intervals, etc) in models that used the data to select predictors is difficult. Chapter 20 of Statistical Learning with Sparsity goes into the problems. As you seem to be primarily interested in using the present data to direct future work, however, that might not be a big issue for you.

Thank you very much for your answer. That's all incredibly useful to know, and I will try and take the approach suggested. — Ben M, Aug 05 '19 at 10:46

score 2 · Answer 2 · answered Aug 02 '19 at 15:23

Your question is, IMHO, slightly off the point. In statistics book often a distinction is made between inference and prediction (e.g. in Harrell 2001 Regression Modeling Strategies, or in Shmueli 2010's paper To explain or to predict?). In your case, I would argue you are actually interested in using the data to form an hypothesis, i.e. explorative data analysis. For that you need no (cross-)validation.

In a way, explorative data analysis imposes least constraints on your analysis. For prediction, you need to make sure that you validate on independent data. For inference, you need to be very careful not to re-use data and fall prey to data snooping. For exploration, however, you only have to make sure you don't interpret your findings' significances (if you have played with your data at all)!

For example, take your data, throw them at some machine-learning tool, such as randomForest, and thereby identify the most important predictors, plot the data, done. But you cannot use the same data to first identify a model structure (e.g. by model selection) and then use the same data to parameterise it and interpret its estimates' significances. That would be data snooping, and your significances will be corrupted (see, e.g., variable selection and model selection).

Whether you are using Gini impurity, or partial R2 or standardised model coefficients (my personal favourite) as "importance" is a secondary issue. But to me there seems little point in doing model selection or model comparisons, as you are not interested in the model, but in the variables' importance. I guess fitting the full model and interpreting that is fine. And for that, I would use randomForest rather than a GLM, as it has all bells and whistles (read: non-linearity and interactions) built in off-the-shelf.

Thank you for your answer. Yes, you're absolutely right. I was largely barking up the wrong tree overall. I think I'd forgotten that very important distinction when researching how to validate models and not fully realising that a lot of the advice is of most relevance to predictive models. For this research, it is more about exploring what the data is suggesting and then going from there, so you are quite right in that an exploratory analysis is what is needed. — Ben M, Aug 05 '19 at 10:31
Thanks for the information on data snooping. The way the project was going, I was indeed guilty of this, and so this is really valuable information to know. — Ben M, Aug 05 '19 at 10:39

score 1 · Answer 3 · answered Aug 02 '19 at 11:41

The model will not be used to make predictions on future outcomes, but to make inferences about decisions during the time period.

Having lots of hope, that I am not mistaken, I understand, that your goal is to make causal inferernce. This means you want to say "such and such decision caused different probability of outcome". Correct me if I am wrong.

Probably the best way to construct statistical inference model is to know how it is done in the given field and relate to that. This means doing a strong literature review. If it is not possible, usually one looks at other fields, which are close in theory, and similar econometrically.

I have never seen cross-validated inference model, thus I would assume it does not help. Nor in papers nor in econometrics books such thing seems to be discussed. It think it could help if you wanted a different thing - finding the best predictors of outcome.

Information Criterions and R$^2$'s seem to be just a merely hints in constructing inference models.

One possible, and often advised, approach to construct such model is to make a list of determinant variables from the articles, and then run model, showing some iterations (they may be general-to-specific ones) to somehow address data mining problem.

Nevertheless if the model is to infer about causation, argumenting that correlation is fair evidence for causation, the key problem is endogeinety and omitted variable bias. It is hard to infer about causation of all variables in one article, so most articles showing general determinants of a phenomenon stick to correlation. Inferring about causality is often done just for one variable (group of variables relating to single phenomenon).

When constructing statistical inference model it is still important to have some causal background. I would suggest to start from here, but when looking for stronger background I would suggest Mostly harmless econometrics and Causality books (both possible to find as pdf for free).

Thank you for the answer. Yes, I think that is pretty much the case, or at least we want to find independent variables that are most strongly associated with the outcome when controlling for other variables in order to explore the topic further. Thank you for making the distinction between causal inference and predictive modelling and for the links. I feel quite silly for forgetting the distinction and/or not fully appreciating the distinction in the first place. What I am after is definitely more related to causal inference and the approaches taken in the social sciences with regards to it. — Ben M, Aug 02 '19 at 14:57
I'm now thinking that I perhaps do not need to drop so many variables from the model as long as the coefficients stay relatively stable across them for the sake of transparency and providing evidence to explore the topic further. I'll probably still go with a regression model as that was used in the previous research, just with the caveat that it does not prove causality, and then further read those sources and hopefully better understand any other approaches suggested as well. — Ben M, Aug 02 '19 at 15:02

Should I cross-validate a logistic regression model that will not be used to make predictions?

3 Answers3