Are thresholds for logistic regression models prevalence-specific?

Question

I wonder if thresholds for logistic regression models are prevalence-specific. I assume that they are, however, I am not sure about the basic statistical principles behind it and how to deal with the implications for clinical practice.

Example:

A hospital wants do deploy a logistic regression model to predict lymph node metastasis in prostate cancer patients. The model is recommended by a specialist society and widely accepted in the medical community.

For model development, a research group used a large dataset where the prevalence of lymph node metastasis was low (15%). They used a lab value (PSA) and age as predictors. After external validation with data from hospitals with similar prevalence (15 %), decision curve analysis and discussing the benefits and harms of the treatment the specialist society found a threshold probability of ≥0.10 appropriate regarding decision if a patient needs specific surgery (medically reasonable amount of true positive and false positive results).

Now the hospital is deploying the model in their surgery consultation-hour (expected prevalence of Patients with lymph node metastasis = 30%).

Questions:

Can they deploy the same threshold probability if they want to have similar true positive and false positive results?
If not, how should the model and/or threshold probability be adjusted (to get similar true positive and false positive results)?

What I already found on this topic:

An interresting blog about prevalence and probability, however, it does not answer my question regarding thresholds.

The Transparent Reporting of a multivariable prediction model for Individual Prognosis or Diagnosis (TRIPOD) statement (W17):

In general, models will be more generalizable when the case mix of the new population is within the case mix range of the development population (186). However, as we describe under item 10e (see also Box C and Table 3), one may adjust or update a previously developed prediction model that is applied in another setting to the local circumstances of the new setting to improve the model transportability.

from W17 Table 3:

Updating Method: Adjustment of the intercept(baseline risk/hazard)

Reason for Updating: Difference in the outcome frequency (prevalence or incidence) between development and validation sample

Reproducible Example in R:

#library
library(tidyverse)
library(rmda)

# train data (prevalence= 15%)
train <- tibble(id=1:1000,
                    class=c(rep(1,150),rep(0,850)))

set.seed(1)
train %>% 
  group_by(id) %>% 
  mutate(
  PSA=case_when(class==1 ~ runif(1,1,100),TRUE ~ runif(1,1,40)),
  Age=case_when(class==1 ~ runif(1,30,80),TRUE  ~runif(1,30,60))) -> d.train
  

# test data same prevalence (15%)
test <- tibble(id=1:1000,
                class=c(rep(1,150),rep(0,850)))

set.seed(23)
test %>% 
  group_by(id) %>% 
  mutate(
    PSA=case_when(class==1 ~ runif(1,1,100),TRUE ~ runif(1,1,50)),
    Age=case_when(class==1 ~ runif(1,30,80),TRUE  ~runif(1,25,60))) -> d.test_same_prev



# test data high prevalence (30%)
test1 <- tibble(id=1:1000,
               class=c(rep(1,350),rep(0,650)))

set.seed(123)
test1 %>% 
  group_by(id) %>% 
  mutate(
    PSA=case_when(class==1 ~ runif(1,1,100),TRUE ~ runif(1,1,50)),
    Age=case_when(class==1 ~ runif(1,30,80),TRUE  ~runif(1,25,60))) -> d.test_higher_prev


# train logistic regression model
glm(class ~ Age+PSA, data=d.train,family = binomial) -> model


# make predictions in cohort with same prevalence
predict(model,d.test_same_prev, type="response") -> preds1
plot(preds1)

# make predictions in cohort with high prevalence
predict(model,d.test_higher_prev, type="response") -> preds2
plot(preds2)



# decision curve analysis same prevalence
d.dca.same <- data.frame(reference=d.test_same_prev$class,predictor=preds1)

dca.same <-decision_curve(reference ~predictor,d.dca.same,fitted.risk=TRUE, bootstraps = 10)

plot_decision_curve(dca.same,confidence.intervals=FALSE)



# decision curve analysis high prevalence
d.dca.high <- data.frame(reference=d.test_higher_prev$class,predictor=preds2)

dca.high <-decision_curve(reference ~predictor,d.dca.high,fitted.risk=TRUE, bootstraps = 10)

plot_decision_curve(dca.high,confidence.intervals=FALSE)

^{Created on 2021-08-08 by the reprex package (v2.0.0)}

Optimum decisions are independent of prevalence but are completely dependent on the probability of an outcome for an individual person. See the Diagnosis chapter in [BBR](https://hbiostat.org/doc/bbr.pdf). — Frank Harrell, Aug 08 '21 at 13:20
@FrankHarrell thank you for this comment, especially [this topic](https://discourse.datamethods.org/t/sensitivity-specificity-and-roc-curves-are-not-needed-for-good-medical-decision-making/1152) from BBR is very helpful. I guess ans. 1 would be no. And Q2? How can we transfer a recommended model with recommended thresholds to a new setting (higher probability of an outcome for an individual)? Can models be adjusted (something like [intercept correction](https://stats.stackexchange.com/questions/6067/does-an-unbalanced-sample-matter-when-doing-logistic-regression)) or should we just not use it? — ava, Aug 08 '21 at 17:14
The answer is to do away with "positive" and "negative" and provide well-calibrated predicted outcome probabilities. Then either (1) when a decision is made you'll know the probability it was wrong (1 - predicted probability) or (2) you develop a utility/cost/loss function that translates predicted probabilities into decisions by maximizing expected utility/minimizing expected cost. The training sample needs to be representative of the target sample for a model to be well calibrated, or you need to have predictors that adjust for population differences. — Frank Harrell, Aug 09 '21 at 11:50
@FrankHarrell thank you for sharing your expertise. I agree with your points. In medical practice, physicians often face widely accepted prediction models (the example above reflects a real world setting) wich are not always perfectly designed. I think it is important to perceive limitations of published models/thresholds (Q1). Unfortunately, building a new model is not always feasible. Therefore Q2 focused on model adjustment/update. I actually found some [literature (page W17, table 3)](https://pubmed.ncbi.nlm.nih.gov/25560730/) that tackles model updating for diff. in the outcome frequency. — ava, Aug 09 '21 at 22:44

EdM · Accepted Answer · 2021-08-12T18:09:27.300

Three intertwined issues need to be disentangled: (1) calibration of a probability model, (2) whether the model should be used to generate a hard probability threshold, and (3) if so, where the threshold should be. Let's take them in reverse order.

(3) If you have a well-calibrated probability model and there is to be a probability threshold, then the choice should be based on the costs and benefits of true and false assignments to each class. This answer explains the choice for the two-class situation, with links to the complications with multi-class models.

The threshold is not part of the logistic regression, although the title of this question seems to imply otherwise. The threshold is chosen based on the intended application's costs and benefits, after the probability model (however devised, it doesn't have to be logistic regression) is in place.

(2) As Frank Harrell said in a comment, "Optimum decisions are independent of prevalence but are completely dependent on the probability of an outcome for an individual person." The probability of an outcome for an individual might depend on clinical considerations outside of what's captured in your probability model.

Furthermore, the cost/benefit tradeoff discussed above might differ among individuals. An 85-year-old with prostate cancer might have less willingness to undergo surgery to search for potentially positive lymph nodes than a 60-year-old. All of that argues against setting firm probability thresholds for individuals based solely on a model.

(1) The heart of this question is thus whether a probability model based on "a large dataset where the prevalence of lymph node metastasis was low (15%)" can be used in a "surgery consultation-hour (expected prevalence of Patients with lymph node metastasis = 30%)." That's a more complicated question about model calibration, in particular whether the logistic-regression intercept should be adjusted for that prevalence difference.

A logistic regression model for probability $p$ of a condition ($D$) as a function of covariates $X$

$$\log \frac {p}{1-p} = \alpha + \beta^T X $$

has an intercept $\alpha$ representing the log-odds of $D$ in the sampled population at a baseline situation when covariate values are 0 (or at reference levels for categorical predictors). (The answer from @Eoin explores the situation when populations differ in baseline prevalence.) The probability of $D$ given $X$ in that same population is:

$$ p(D|X) = \frac {\exp(\alpha + \beta^T X)}{1+\exp(\alpha + \beta^T X)}.$$

McCullagh and Nelder show (Section 4.3.3) a situation that might need adjustment of the intercept to take the sampled population into account. A retrospective study might evaluate all cases with $D$ but only a subset of those without the condition ($\bar D$). Then to estimate $p(D|X)$ with the above formula in the entire population, you need to adjust the intercept to $\alpha^*=\alpha + \log(\pi_0/\pi_1)$, where $\pi_0,\pi_1$ are the fractions of cases $D$ and non-cases $\bar D$ sampled, respectively. But they warn:

It is essential here that the sampling proportions depend only on $D$ and not on $X$.

That's probably not the case in your example of positive-node probability in prostate cancer patients evaluated in a "surgery consultation-hour." Those patients were chosen in part because their covariate values $X$ (probably including PSA and age) indicate that they already are at higher risk of nodal spread than the overall population of prostate cancer patients.

If the original probability model was properly calibrated for the overall population of prostate cancer patients (15% node-positive), the question is whether that overall population is adequately representative of your overall prostate cancer population. In part: is the probability of node-positivity at baseline covariate conditions in the original study similar to yours?

Patients discussed in the "surgery consultation-hour" presumably aren't at baseline covariate conditions. They were pre-selected based on suspected higher risk and thus should have higher expected node-positive probability. If the original model is well calibrated with respect to your overall prostate cancer population, there should be no problem applying it to this pre-selected higher-risk subset.

Thanks, this is very helpful. [As I take it](https://stats.stackexchange.com/questions/176341/logistic-regression-intercept-representing-baseline-probability), I cannot identify the probability at baseline cov. cond. since the original study did not publish an intercept. The orig. model might not be well calibrated, external validation revealed overestimation of risk of positive lymph nodes above a prob. of 0.5. It is stated that this is OK since in their setting 90% of the patients have risk <0.10. What do you exactly mean by "your overall prostate cancer pop."? All patients before selection? — ava, Aug 13 '21 at 09:30
@ava By the "overall prostate cancer population" I mean all prostate cancer patients who might come through your institution--whether or not evaluated at the surgery consult--and would have been included in the sample used to build the model had they been at the other institution(s). If the published model provides a probability of nodal spread as a function of covariates, it might be possible to estimate its intercept from my second formula. If it just set an arbitrary threshold as a function of covariates, I'd be reluctant to use it. — EdM, Aug 13 '21 at 14:21

Eoin · Answer 2 · 2021-08-12T14:35:02.557

Notwithstanding @Frank Harrels's comments, I think it's useful to think about this in terms of the intercept or bias terms in the logistic regression model, rather than in terms of thresholds. I'm not totally confident in this approach, but it should hopefully be useful!

Let's imagine you only have one predictor, $x$. For convenience, let $x$ be centred to have a mean of $0$ in the training data. The model is then $P(y_i = 1) = \text{logit}^{-1}(\alpha + \beta x_i)$, the intercept $\alpha$ is the log odds of a positive outcome when $x = 0$ (the average value).

Given a set of predictors values $x_1, x_2, \dots, x_N$, and the parameters $\alpha$ and $\beta$, the predicted prevalence is just the average of the predicted probabilities (I think), $\frac{\sum_{i=1}^N \text{logit}^{-1}(\alpha + \beta x_i)}{N}$.

Now, if prevalence is 15% in your training context, and 30% in your test context, there are a few possible explanations.

The first is that the distribution of the predictors - just $x$ here, but multiple things in reality - differ between the contexts, and this difference explains the difference in the total number of positive cases. If this is the case, your model can be used without modification in the test context.
The second is that the distribution of the predictors hasn't changed, but some additional factors not captured by your model have. This could be handled heuristically by adjusting the value of $\alpha$ until the average predicted probability of a positive outcome matches the prevalence you expect for the test data (30%).
The third, and most likely explanation is a mixture of the above: some things captured by your model have changed, and some things not not captured have changed as well. I think, but I can't say for certain, that this could handled in the same way, by adjusting the value of $\alpha$ until the mean predicted probability matches the expected prevalence.

Now, none of this will help if the relationship between your predictors and the outcome differ between the training and test contexts, but there's not much we can do about that.

Update

For my own L&D, I had a go at simulating this, and it seems to work as described. Code is here: https://gist.github.com/EoinTravers/656ac7b77a5cfa966c706888185afcd5

score 2 · Answer 3 · answered Aug 12 '21 at 18:57

In a sense, yes. This is the default with logistic regression. It is not treated as a "problem" to be remedied, but perhaps it should be. "Prevalence" here is taken to mean in-sample prevalence: specifically, if you calculate the fitted probabilities for each patient in a logistic regression sample, and perform an average, you will obtain the in-sample prevalence. An example in R:

set.seed(123)
x <- seq(-3, 3, 0.01)
y <- rbinom(length(x), 1, plogis(-3 + 0.4 * x))
f <- glm(y ~ x, family=binomial)
sum(f$fitted)
sum(y)

give

> sum(f$fitted)
[1] 42
> sum(y)
[1] 42

i.e. 7% prevalence.

Nevertheless, it's possible to build out complicated heirarchical models, or weighting to handle issues, such as oversampling of cases such as in a case-control study. Or to handle nested samples. Scott and Wild (1999) discuss weighting case-control case samples by the "known prevalence" of outcome, and conversely for the controls. This corrects the intercept term in the model so that the calibration is fitted to the referent prevalence. Of course, even the "known prevalence" may have uncertainty and there is not yet any optimum way to account for these layers of error.

One relevant example is the COVID test. I'm not sure if a logistic model is involved in predicting presence of disease, but it was somewhat shocking to see how a test that was developed to diagnose presence of disease among symptomatic people at a particular time of the epidemic was basically unchanged for performing the same test in asymptomatic people later on. Only recently have the number of PCR cycles been adjusted which consequently reduced the number of false positive cases.

Are thresholds for logistic regression models prevalence-specific?

3 Answers3

Update