Using patient characteristics to predict disease and confront with per hospital real cases

Question

We have a large, nationwide prevalence study about the prevalence of healthcare-associated infections (HAI).
We need to see if hospitals have more or less HAI than expected by their patients' characteristics.

We estimated the per-patient HAI risk using a conditional tree model with cross-validated tree depth to avoid overfitting. Since the HAI are rare (8% prevalence in our data), we also used oversampling of cases to create the model.

The model seems good: 100% sensitivity, 90.5% specificity, 100% negative predictive value. Just the positive predictive value is low: 10.5%. The model predicts too many cases, probably because of the oversampling.

So when we compare the predicted cases with the real cases at the hospital level we almost always an excess of cases.

The question is: is this a good way to predict the patient based risk separated by the hospital effect? Is there a better data-driven way to separate patient-related and hospital related-risk?

Thanks

UPDATE: By curiosity, I tried also a random forest model, which gives out almost 100% accuracy even with cross-validation. Is this a case of overfitting resistant to cross-validation or should I think that simply patient data contains all the information to predict the HAI and there is no contribution from the hospital to the risk?

Do you have per-patient data available at the hospital level, or are you relying on aggregated data for each hospital? — EdM, Jan 25 '18 at 18:34
Also, are the nationwide data on which you built the model per-patient or aggregated in some way? — EdM, Jan 25 '18 at 19:06
Have you corrected the predictions for the oversampling? It appears not, in which case you might want to see what happens when you do. This link, which hopefully won't break anytime soon, shows one way of doing so: http://www.dnbtechnology.com/pdf/DandB-Correcting%20Sample%20Bias%20in%20Oversampled%20Logistic%20Modeling%20Building%20Stable%20Models%20from%20Data%20with%20Very%20Low%20Event%20Count.pdf — jbowman, Jan 25 '18 at 19:34
Theoretically I should have done it, because I used the oversampling tools in the R caret library which correct for oversampling during the cross-validation itself. I'll check the link though, thanks! — Bakaburg, Jan 25 '18 at 19:43

EdM · Accepted Answer · 2018-01-25T22:10:37.863

This is a case in which a per-patient probability model would be much better than a strict yes/no per-patient classification model such as you have developed.

Classification models developed on mis-classification rates have a hidden assumption that all types of mis-classifications are equally bad. Whether or not that was specifically the case in your modeling, the criteria you used to develop your classification model clearly did not meet your goal of being able to separate out any hospital-related risks.

A model that predicts the probability of each individual developing HAI would be much more useful. With estimates of the probability of each individual in a hospital having HAI based on the nationwide data, you could assess whether the hospital has more or fewer HAI than would be expected from the nationwide data, taking the number of individuals and the probabilities estimated from their characteristics into account. If patient probabilities within a hospital are independent, the sum of their probabilities (and thus the number of HAI events) is a Poisson binomial distribution, so you can calculate (or take multiple random samples to estimate) the distribution of expected HAI events based on the nationwide data for the patients in each hospital and see if the observed values for the hospital are significantly toward a tail of that nationwide distribution.

Logistic regression is a classic way to develop such a probability model. In a logistic model you also could incorporate hospitals as fixed or random effects and get direct estimates of inter-hospital differences.

If for some reason a logistic regression is inappropriate, it is possible to develop logistic regression trees that predict probabilities rather than classifications. A web search for "logistic regression tree analysis" should provide many more links to this type of approach. (I have no experience with such models, and don't know if you could directly model inter-hospital differences with this approach.)

If this is a US-based study, however, I fear that inter-hospital differences in medical record systems (and thus in identifying true patient characteristics) might overwhelm the inter-hospital iatrogenic issues that I suspect are your primary interest.

The study is not US based and the data is taken manually at the patient bed, not via EHR (I wish we could...). The models I use can also output the per patient proportional risk. But how should I use it to compare it the hospital prevalence? — Bakaburg, Jan 25 '18 at 19:51
@Bakaburg in a logistic regression model hospitals could be included as fixed predictors or as random effects and provide direct estimates of hospital-based differences in baseline HAI prevalence (logistic regression intercept). With individual patient probabilities, one approach would be to combine the predictions based on nationwide data of individual cases for each hospital into a predicted distribution of HAI case numbers for the hospital, and see which hospitals have HAI numbers that are at extreme tails of the predicted distributions. That would at least identify outlier hospitals. — EdM, Jan 25 '18 at 20:15

score 1 · Answer 2 · answered Jan 25 '18 at 22:40

I have a relevant, somewhat ancient question here: Case-mix adjustment versus risk adjustment, what are their differences in practice and objective?

Your outlined approach is a case-mix approach: you use a hierarchical model to calculate what the expected incidence would be at the hospital level conditional upon the patient risk factors: indeed, many such factors predict adverse outcomes: smokers, gender, age, previous antibiotic (over)usage, frailty, high BP, all are predictors that should be taken into account.

You then compare the expected to observed incidence at each hospital, and infer that an excess of cases is evidence for hospital malpractice. This approach is not valid.

The reason why is that hospitals vary in the volume of services provided. A 20 year old patient rendering care for a cold at his or her local urgent care is far less likely to develop MRSA than a 20 year old gunshot victim at the level 1 trauma center, despite having a similar risk profile (according to patient characteristics). My suggestion is to feed your patient risk model forward and then characterize the procedure profile for each hospital, more procedures and more risky procedures are necessarily going to increase the incidence of infections.

Thanks for your answer. Our data is actually very clinically centered with just sex and age as demographic data and the majority of data being about clinical severity (McCabe score), specialization of the patient, specialization of the ward of hospitalization, use of antibiotics (one of the main predictors as expected!), length of hospitalization, etc... which should be a proxy for the complexity of patients. Do this reduce the bias you were talking about? Can you elaborate more on your sentence: "feed your patient risk model forward and then characterize the procedure profile"? — Bakaburg, Jan 26 '18 at 09:28

Using patient characteristics to predict disease and confront with per hospital real cases

2 Answers2