Apologies if this question doesn't make much sense and if it is overly long. I had some basic stats training at university, but there are lots of gaps in my knowledge which I've gradually been trying to fill in, but I've been getting into a bit of muddle when it comes to model selection, specifically in this case about how to choose between a series of logistic regression models and whether cross-validation is necessary for non-predictive logistic regression models.
The models are based around inferring the variables that are most strongly related to the outcomes of complaints against practitioners in a profession over a three year period.
- Ideally, I want to choose a relatively parsimonious model that best fits the data during the time period.
- The model will not be used to make predictions on future outcomes, but to make inferences about decisions during the time period.
I've been going with a choosing a model taking into account AIC and BIC scores. I read that judging models based on AIC scores was best for using those models to make predictions, and BIC scores were better for choosing 'truer' models based upon the data used to fit the model. I thought I had developed a relatively good model based upon this approach, but am now not too sure.
My supervisor ran another model using stepwise selection, as that was how it had been done for previous research. The model produced has a relatively low BIC score in comparison to the ones I've developed, and relatively high AIC score, but is quite different in terms of a few of the independent variables it contains. I've read that stepwise selection is not a good way to select a model and would not want to go with this approach overall if possible. But then, I now feel like I need to further justify the models I produced and am not sure entirely how to do this.
Also, given how different the independent variables of lesser significance (based upon higher p-values) in the model are when comparing the stepwise model developed by my supervisor and the one I came up with, I'm thinking there may be a multitude of models that would give relatively low BIC scores and I am not sure how to choose between them.
My supervisor has also produced some pseudo R-squared measures for his model (Cox-Snell and Nagelkerke), and I have compared these to the ones I've produced, but there's no overall conclusion to this. They are all quite similar in score. I am not entirely sure whether comparing models based on pseudo R-squared measures is a good idea, and from what I've read, it doesn't seem like it would be.
I have a bit of knowledge of how cross-validation works, and how it is used to validate predictive models by holding out test sets from training sets. I'm not altogether sure however, whether this process would be needed for non-predictive models, and whether it would be useful to choose between the models I've developed if they aren't to be used for predictive purposes.
So, basically, I have a few questions:
- Is cross-validation useful in choosing between inferential models that will not be used to make predictions?
- If the BIC scores from a model produced by stepwise regression are relatively low in comparison to other models, would it be a useful model despite all of the drawbacks of stepwise regression?
- Is comparing models from the same dataset by using pseudo R-squared scores a bad idea?
- What approach would you take to choose between competing regression models that will not be used for prediction?
Hope this makes sense. Any help would be greatly appreciated.
Edit: To elaborate, there are just over 1000 cases and around 50 dummy variables to choose from, most of these variables appear to be insignificant. There are two variables that appear in every model, and then around 10-15 others that may help improve the model. The model with the lowest AIC has 11 independent variables, and the one with the lowest BIC has 9 independent variables. All of the variables are categorical.
I was hoping to go with a more parsimonious model as I thought that when making inferences and not predictions, it helps to represent a more stable representation of the data generating process, and a less bloated model when trying to explain it to non-researchers, but I'm not sure if this is right or not. I'm not sure how to phrase exactly how parsimonious the model would ideally be. I think I have previously been taught to depend on p-values too heavily, and shouldn't think so much about the statistical significance of the variables. Ideally I would like the coefficients of the variables in the model to stay relatively stable across models and improve upon model fit perhaps.
I think I'm gradually discovering about the amount of uncertainty involved in all of this and how it is also an art as well as a science.