Goal is to find the top-predictors related to hospitalisation (this could be indicative of disease deterioration) --> These findings may guide new research on these predictors and are not directly intended as prognostic tool.
First, one hopes that the "top-predictors related to hospitalisation" won't differ much among these models. There's no need to choose just one model to reach that goal. Showing results of several models all pointing to the same predictors will strengthen your case.
If the models do differ substantially among top predictors, then there is a potentially serious problem. You would need to examine reasons why, and decide which of the models might best "guide new research on these predictors" based more on your understanding of the clinical situation.
The exception to the above would be the LASSO models. Predictors in clinical studies tend to be highly inter-correlated, and LASSO's choice of which of a set of correlated predictors to retain can be very sample dependent.
Second, with that goal in mind, the particular confusion matrix that appears to be best becomes less of an issue. The critical thing is to have well calibrated models. Calibration represents the quality of the model over the entire probability range, assessed by measures like log-loss or Brier scores.Then focus further work on the predictors that are both important in calibrated models and might be amenable to interventions to reduce avoidable hospitalizations.
Third, the differences in confusion matrices might be overstated in your examples. The XGBoost matrix is evidently based on a probability cutoff of 0.5, while a different cutoff might perform better. Also, and maybe more important, the matrices are based upon a single particular set-aside test set. There's a danger that a model appearing to be better with that test set might be worse with a different training/test split choice, and thus might not be the model that best represents the characteristics of the underlying population.
You can clarify these issues by resampling the data to validate the model-building process. You are already using re-sampling in cross-validation (within your training set) to choose hyperparameter values. Try using bootstrap resampling at an even higher level.
With data sets of less than several thousand cases, it's best to use the entire data set to build the model and then to test modeling performance via bootstrapping. Try the following for each of your model types and scoring functions, which shouldn't be too onerous with your automated approach and only a few hundred cases.
After building a model with the full data set,* take repeated bootstrap samples from your data. Perform your entire modeling process (including hyperparameter optimization) on each bootstrap sample as a training set, then use the full data sample as the test set. As bootstrapping mimics the process of taking your data sample from the underlying population, this process provides an estimate of how well your model-building process would work on multiple samples from that population.
Examine the performance of each of your model types and scoring functions over many bootstrap samples. Look at average performance over the bootstrap-based models as a measure of overall quality. Look at differences between training-set and test-set predictions to evaluate modeling bias. For models that do predictor selection or rank predictor importance, see how well those selections or rankings are maintained among the bootstrap-based models.
It's possible, after you have gone through that process, that the random forest model based on the F1_weighted
score will still perform best by your criteria. But then you will have documented its superiority in a robust way that should help allay any theoretical concerns.** As your goal is to identify important predictors rather than to develop a prognostic tool, you might worry less about how you got to a model and worry more about how well the model is pointing you to critical predictors.
I'd recommend looking carefully at Frank Harrell's course notes for guidance. Although written from the perspective of regression models, the principles apply very generally. Chapter 5 is particularly relevant to assessment and validation of models, and Section 4.10 touches on ways to compare models.
*Two thoughts on your modeling per se. For one, your class imbalance isn't that bad. Your attempts to adjust for class imbalance might be making problems worse with some types of models. For another, make sure that this event-probability/classification model is appropriate for your data. If you are examining something like 60-day hospital readmissions, when you have complete data on all cases for the entire time span, that's OK. If you're looking at a longer time span over which all you can say is that some people haven't been hospitalized yet, you should be taking that into account with a time-based survival model that handles that "right-censoring" of times to events. There are random-forest, gradient-boosting, and several standard survival-regression implementations available for survival models, at least in R.
**In terms of those theoretical arguments, even though F1_weighted
isn't a proper scoring rule, it overcomes one limit of the standard F1 score--ignoring true negatives--by calculating separate F1 scores for both classes as the positive class, then doing a weighted average. Log-loss puts a lot of weight on the extremes of the probability scale and might not always be the best choice among proper scoring rules. You could, for example, try to use the Brier score instead of log-loss as the criterion for model building to minimize problems at probability extremes.