We have a large, nationwide prevalence study about the prevalence of healthcare-associated infections (HAI).
We need to see if hospitals have more or less HAI than expected by their patients' characteristics.
We estimated the per-patient HAI risk using a conditional tree model with cross-validated tree depth to avoid overfitting. Since the HAI are rare (8% prevalence in our data), we also used oversampling of cases to create the model.
The model seems good: 100% sensitivity, 90.5% specificity, 100% negative predictive value. Just the positive predictive value is low: 10.5%. The model predicts too many cases, probably because of the oversampling.
So when we compare the predicted cases with the real cases at the hospital level we almost always an excess of cases.
The question is: is this a good way to predict the patient based risk separated by the hospital effect? Is there a better data-driven way to separate patient-related and hospital related-risk?
Thanks
UPDATE: By curiosity, I tried also a random forest model, which gives out almost 100% accuracy even with cross-validation. Is this a case of overfitting resistant to cross-validation or should I think that simply patient data contains all the information to predict the HAI and there is no contribution from the hospital to the risk?