Just for fun I'm trying to do some simple medical diagnosis using Bayes' theorem. Right now I'm calculating
P(condition | symptoms) = P(symptoms | condition) * P(condition)
for each possibly condition, then choosing the most likely condition given the present symptoms as the "diagnosis" (note that for simplicity's sake I assume that the symptoms are independent). This works well when I have a complete list of the probabilities P(symptom | condition)
for all symptoms and conditions.
However, I want to do better in the case where I do not know how likely each symptom is to occur as part of every disease. Let's say, for example, that I have a "patient" with a long list of symptoms, and two possible conditions A and B. For condition A, I have a full list of the symptoms and their probabilities, while for condition B I only know the five most common symptoms. To calculate P(condition B | symptoms)
my current solution is to set P(symptom | condition B)
to some base rate, e.g. 0.01 both when
I know for sure that the symptom is never caused by condition B and when I don't know the real rate of the symptom under condition B.
This leads to problems since condition A will often end up as the "diagnosis" even if every P(symptom | condition A)
is low, if the number of known symptom probabilities given condition A is higher than the number of known probabilities given condition B.
What is the best way to properly handle this uncertainty and solve the problem presented above?