4

Suppose I have observational data and want to predict some disease based on patient visits. In each visit I know whether the disease occured. The disease is preventable (which is why I want to predict it---so it can be prevented). I want to use past visits to predict risk during future visits. There is no data on what happens between visits, and the visit for which we hope to predict disease risk is simply a binary variable indicating presence. The disease can occur any time during the visit.

However consider a patient who leaves a visit with high risk for the disease but between visits decides to change lifestyle and dramatically reduces risk. We then get in the next visit that the patient does not get the disease. However, all we see is the previous visit. Hence the model learns that the previous visit, which left the patient at high risk, is actually associated with a negative outcome!

Another scenario: say that the patient leaves a visit with high risk, but this time does nothing. At next visit, the doctor sees the high risk and immediately gives a medication to prevent the disease. Now, we again get a negative response for this very high risk patient.

Ideally, to learn that this patient had a high risk visit, we needed to see them get the disease.

Hence, there is a catch 22. If you can prevent a disease, you cannot develop a predictive model; if you cannot prevent it, why try to predict it?

This seems equivalent to being presented a pristine dataset (where the responses really do reflect the risk) and then having some malevolent analyst secretly ---and systematically--- change an unknown number of the labels to the opposite outcome. The resulting model ultimately predicts something (and it might do it well), but it is not predicting risk.

sjw
  • 5,091
  • 1
  • 21
  • 45
  • Nice question. It is not so hopeless as it might seem. For one thing, the variable(s) used as an indicator that the disease is developing can itself serve as the dependent variable. Then, the resultant improved lifestyle, or the resultant medication that saves the day, will not interfere with the modeling, since they'll be after the fact. – rolando2 Jan 05 '18 at 17:27
  • 1
    If there are lifestyle choices and medical interventions known to prevent the disease than these should be included as factors in the predictive model. I don't see the catch. – mzunhammer Jan 05 '18 at 17:32
  • @mzunhammer you are correct if the lifestyle choices and medical interventions are known, but that is not always the case. Especially lifestyle choices are very difficult to acquire data on. Let's assume in this particular case that the data between visits is not available. For visits (when a provider could affect the outcome via intervention) we just have a binary response indicating presence of disease---we don't know when it happened. That is also common in for example electronic health record data where diagnosis codes are assigned usually after a visit. – sjw Jan 05 '18 at 18:00
  • @rolando2 hence whatever the doctor sees, the model sees as well, before intervention can be made. This is an ideal approach (as also mentioned by @mzunhammer), although I do not often see data detailed enough to allow for it (e.g., this might require data between visits and the time that the disease occurs during a visit). I have slightly modified the question to include these "missing information" constraints. – sjw Jan 06 '18 at 13:38

0 Answers0