The data include 3 equally sized subsets A, B and C, belonging to two classes:
- A belongs to class 1.
- B and C belong to class 2.
The prior probabilities of an observation coming from class 1 and class 2 are thus 0.33 and 0.67.
Next, a logistic regression model is fitted on all 3 subsets.
The predicted value of this model is the probability of an observation belonging to class 2 given his predictors values.
In reality I know for sure that I will never have observations belonging to subset C. So the observations will allways originate from either subset A or B and since both subsets are equally sized, I can assume that the prior probability of a new observation to be from class 1 or class 2 will changes to 0.5.
My questions are:
- Given the knowledge that all observations are from either A or B but not C, can you still interpret the predicted values as the probability of being in class 2 with the logistic regression model fitted on all 3 subsets?
- Are these probabilities biased because of the changed prior probability of being in class 1 and 2?
- If so, how to correct for this?