0

Most examples of logistic regression uses a cutoff (for classification) of 0.5. But I suspect this is a reasonable starting point only if you have equal number of positive samples and negative samples.

Suppose we want to classify whether someone will get a disease or has a disease. We randomly draw 1000 people from the population. Only 50 (5%) of the 1000 people have the disease, while the other 95% don't. Suppose we train a logistic model based on various attributes of the 1000 people.

Would it be reasonable to use a cutoff of 0.05 for this case? Any prediction greater than 0.05 gets classified as "disease." I understand the threshold should be determined by acceptable number of false positives/negatives, but I'm just talking about a starting point here.

Suppose the ground truth is that 5% of population has this disease. But in a 2nd case, we sample in such a way so that we have 500 people with the disease and 500 don't, and train another model. We would expect the output from the logistic model to be a lot closer to 0.5 than 0.05 in this 2nd case right?

This makes me think that the logistic regression uses the distribution of the different classes in the training data as their prior probabilities. And the output is the posterior probability. Is this a way to understand it?

Paul
  • 1
  • 1

1 Answers1

1

It doesn't help to use the term 'prior' here unless you are using an explicitly Bayesian method.

Logistic regression estimates a model for binary (usually, there are variations for more categories) dependent variables.

As part of this, it estimates a probability that an observation is in one of the two categories, p. Let's suppose that is category 1 of 1 or 2. The probability that it is in the other category is 1-p = q.

Then, if the estimated p is greater than 1/2, the better category to estimate is category 1. If p is less than 1/2, it is better to predict category 2.

If your proportion of category 1 is small, then very few estimates of p are likely to be greater than 1/2, possibly none. In this case, it is better to estimate category 1 for the very few or no observations with p > 1/2.

You might lust to choose category 1 for some of the observations, probably those with the biggest estimated values of p, even if they are 1/10 or 1/5. If you do, then the probability is high that you just chose the wrong category. If you were betting on these outcomes, you would just make a lot of losing bets. Resist this lust with all your might. It will gain you nothing.

So, yes, the estimates from logistic regression do depend on the frequencies of the dependent variable categories. The average value of p equals the proportion of 1's, as I have described them.

So, what should you do? Go ahead, ignore your results, pick any category you want for each observation? Sure, just ignore your estimates; why bother doing the regression in the first place?

This happens all the time with certain kinds of outcomes, such as patient deaths, graft failures, readmissions, etc.

For many kinds of outcomes, the predicted category just isn't a very useful piece of information, except for certain evaluations of the fit of the model.

I might add that the se's of the estimated coefficients depend on the proportions of y for a fixed sample size. If observations are difficult to make, requiring a lot of interviewing, lab tests, or digging up old records, you might choose to collect data on equal samples of 1's and 2's.

If 1's are harder to come by, you might choose to collect more 2's than 1's, the usual rule of thumb being up to about 5 times as many. If you do this, you will get correct estimates of coefficients but not the intercept. You might want to modify your intercept somewhat arbitrarily or use a prior to get more accurate estimates of the probabilities.

David Smith
  • 800
  • 4
  • 12