I'm designing a logistic regression model to predict hospital mortality.
Why? To identify 'adjusted' odds ratios for a variable of interest on mortality.
Methods: - set up using a training dataset (75% of total)
- I have started with 19 variables (dataset 1684 observations).
- Included all variables with p<0.2 from univariate analysis
- Using stepwise selection (stepAIC function in MASS package (R))
- Testing for confounding using interaction terms for variables in later models
When I run predictions on the test cohort (25%), I get the following model diagnostics:
- Sensitivity 12%
- Specificity 95%
- Accuracy 78%
Looking at the confusion matrix, the model is predicting the outcome to be the largest class - leading to a high accuracy but very poor model overall.
How can I improve the model?
Possible solutions?
- Go back to drawing board and find 'better' variables that may be predictive of mortality?
- Balance the data in the training data set via up/down sampling?