Most examples of logistic regression uses a cutoff (for classification) of 0.5. But I suspect this is a reasonable starting point only if you have equal number of positive samples and negative samples.
Suppose we want to classify whether someone will get a disease or has a disease. We randomly draw 1000 people from the population. Only 50 (5%) of the 1000 people have the disease, while the other 95% don't. Suppose we train a logistic model based on various attributes of the 1000 people.
Would it be reasonable to use a cutoff of 0.05 for this case? Any prediction greater than 0.05 gets classified as "disease." I understand the threshold should be determined by acceptable number of false positives/negatives, but I'm just talking about a starting point here.
Suppose the ground truth is that 5% of population has this disease. But in a 2nd case, we sample in such a way so that we have 500 people with the disease and 500 don't, and train another model. We would expect the output from the logistic model to be a lot closer to 0.5 than 0.05 in this 2nd case right?
This makes me think that the logistic regression uses the distribution of the different classes in the training data as their prior probabilities. And the output is the posterior probability. Is this a way to understand it?