I need to produce predictions for a binary state at the individual level. The response variable is imbalanced, about 99:1, with the positive class being the minority.
Each row in my dataset represents an individual, and my predictors are predominantly categorical, and based on that individual's characteristics. I have just one continuous predictor, the individuals age. I have built a logistic regression model to provide probabilities of a positive response, and the model diagnostics all look good.
I am now at the stage of using the model to make predictions about new individuals, using some test data with known responses. My question concerns the best way to apply a decision to the predicted probabilities, in order to make realistic classifications at an individual level.
If I apply a decision threshold to the predicted probabilities I get roughly the expected proportion of positive cases, but the distribution of characteristics does not look realistic. For example, one categorical predictor is "US_State", which has two levels, "NY" and "NJ" in a ratio of 95:5. Among those with a True positive, 68% are from NY, and 32% from NJ. However, applying a decision threshold results in predicted positives for 17% from NY, and 83% from NJ.
Because prevalence is much higher in NJ, the predicted probabilities are higher, and my decision threshold results in a very high proportion of NJ individuals with a positive response.
I understand that the model is working correctly, the probabilities for NJ are higher because that's what the data shows.
My problem lies in how I want to use my model. I want "individual-level" predictions, while the model is returning probabilities for groups of people.
If this was linear regression, I could look at the prediction intervals, and perhaps sample from within it. But as I understand it, prediction intervals are pointless for logistic regression, as they are almost always 0 and 1.
One possibility I've explored is assigning a random number between 0 and 1 to each individual, and then making the classification decision based on whether that random number is greater than their predicted probability or not. This works to some extent, but doesn't seem ideal - the results are a bit too random.
I wondered if there were standard ways of dealing with this problem - generating individual level predictions from logistic regression probabilities?
Many thanks