Probabilities in case-controlled studies

Question

I have a nested-case control study that I have been using for analysis. At the end of my work I have deduced a set of variables that I use later to to classify new cases. One example of a simple classifier I am using is a naive Bayes, which will output simply a probability.

So here is my question:

Could I make my probabilities reflect the real world? In my specific example, the condition that I am testing for has a prevalence of 33% in my study, but a it has a population prevalence of only 10%. Bayes factors have been suggested to me as a way to achieve this, however I am little unsure how to set up the problem.

As an example I have seen a Bayes factor as a logit between the true vs. study prevalence of the outcome. The classifier however was a logistic regression, and in that case the Bayes factor was just added to the linear predictors. I think the example there was very specific, and perhaps an inappropriate method for probabilities of a naive Bayes. Instead what I did was add the logit Bayes factor to the logged probabilities, but I am also not convinced this is right either. I also think a simpler solution would be to use Bayes theorem directly, but there I am not sure how to represented my study vs.population prevalences. The method below isn't quite right, but gets at what I want:

        p_final = classier_posterior*(population_prev)/(study_prev)

I should contextualize that I use the probabilities to establish a threshold for classification down stream.

score 2 · Accepted Answer · answered Jul 15 '11 at 13:43

2

Your proposal makes sense in this context. The Naive Bayes formulation (using the same language as Wikipedia) is:

$P(C|F_1,\ldots,F_n) \propto P(C) \prod_{i=1}^n P(F_i|C)$

The $P(F_i|C)$ terms are estimated from the data, but instead of estimating $P(C)$ from the data (study prevalence), you use a different measure (population prevalence). This is identical to your proposal above.

answered Jul 15 '11 at 13:43

Simon Byrne

3,336
15
29

I strongly agree with this response, but I am also considering that there are some questions we cannot answer with case control studies. So perhaps we can only use the case-control study to inform our analysis of a randomized study design and we may not want to try draw generalizations from case-control a study design. – user4673 Jan 17 '12 at 00:43
That's true: case-control studies are quite useful (they can be cheap and quite powerful), but they are still observational, and can be subject (and quite sensitive) to selection and temporal biases. Basically, as with any technique, you need to understand the shortcomings. – Simon Byrne Jan 24 '12 at 10:47

score 1 · Answer 2 · answered Jun 14 '11 at 22:27

After a few days, I decided it may be best to use an alternative method. What I did was sample the data such that it reflected the reported distributions in the population. I repeated this a number of times, each time randomly sampling in appropriate proportions, and took the average performance on the classifier.

I continued to use the case-control design to find the features that I wanted, however in the validation step and subsequent performance reporting I used the sampling method. This seemed to me a simpler and more straight forward alternative to using a Bayes Factor.

Probabilities in case-controlled studies

2 Answers2