Obtaining individual level classifications from predicted probabilities

Question

I need to produce predictions for a binary state at the individual level. The response variable is imbalanced, about 99:1, with the positive class being the minority.

Each row in my dataset represents an individual, and my predictors are predominantly categorical, and based on that individual's characteristics. I have just one continuous predictor, the individuals age. I have built a logistic regression model to provide probabilities of a positive response, and the model diagnostics all look good.

I am now at the stage of using the model to make predictions about new individuals, using some test data with known responses. My question concerns the best way to apply a decision to the predicted probabilities, in order to make realistic classifications at an individual level.

If I apply a decision threshold to the predicted probabilities I get roughly the expected proportion of positive cases, but the distribution of characteristics does not look realistic. For example, one categorical predictor is "US_State", which has two levels, "NY" and "NJ" in a ratio of 95:5. Among those with a True positive, 68% are from NY, and 32% from NJ. However, applying a decision threshold results in predicted positives for 17% from NY, and 83% from NJ.

Because prevalence is much higher in NJ, the predicted probabilities are higher, and my decision threshold results in a very high proportion of NJ individuals with a positive response.

I understand that the model is working correctly, the probabilities for NJ are higher because that's what the data shows.

My problem lies in how I want to use my model. I want "individual-level" predictions, while the model is returning probabilities for groups of people.

If this was linear regression, I could look at the prediction intervals, and perhaps sample from within it. But as I understand it, prediction intervals are pointless for logistic regression, as they are almost always 0 and 1.

One possibility I've explored is assigning a random number between 0 and 1 to each individual, and then making the classification decision based on whether that random number is greater than their predicted probability or not. This works to some extent, but doesn't seem ideal - the results are a bit too random.

I wondered if there were standard ways of dealing with this problem - generating individual level predictions from logistic regression probabilities?

Many thanks

I think predicting using a binary random variable with probability of success equal to the predicted proability could be a good way, but that's what you did with assigning a random number between 0 and 1 and classifying if predicted probability is above. What makes it not ideal for you ? — Pohoua, Oct 04 '21 at 10:29
@Pohoua When I tried this it seemed like the resulting classification was based almost entirely on the random number, rather than the predicted probability values. Perhaps this is because my response is so imbalanced - the probabilities are very small. Also, I didn't know if this was a reasonable thing to do - I made it up, and haven't seen others use this method, so partly just wanted to check that. — rw2, Oct 04 '21 at 10:38
Why do you say that you’re generating “group-level” predictions instead of “individual-level”? Aren’t you predicting the probability for Dave from New York and Hannah from New Jersey? — Dave, Oct 04 '21 at 10:45
No, because there Name isn't one of the predictors :-) I'm just predicting for "somebody" from New York, and somebody from New Jersey. e.g. The model will predict probabilities that are the same for all people of the same age (and share the same other characteristics) from New York. I need a way to get from those "group-level" probabilities, to a prediction for Dave and Hannah. Because in reality, all people of the same age in New York will not share the same response. — rw2, Oct 04 '21 at 10:49
As you write, at some point, you will need to refine your model to give you better classifications tailored to the individuals (assuming you have training data on this level). Until then, you are stuck with the probabilistic classifications you have. In both cases, you will need to tailor your decision threshold to the costs of misdecisions either way. [Classification probability threshold](https://stats.stackexchange.com/q/312119/1352) may be helpful. — Stephan Kolassa, Oct 04 '21 at 15:57
@rw2 Why should the model predict different probabilities for people who have the exact same characteristics? You're telling the model that these people are the same. — Dave, Oct 04 '21 at 15:59
@Dave The model shouldn't predict anything different - that's not what I'm asking, and I state in the question that I understand the model is working correctly. My question refers to the process of producing classification decisions from those probabilities. — rw2, Oct 04 '21 at 17:47

Obtaining individual level classifications from predicted probabilities

0 Answers0