I want to externally validate a logistic regression model that my former colleague constructed together with another company, and that has been published. I have data from another cohort to which I want to apply the model. I'm using R, but I think my question is more about the concept of logistic regression in general.
I have found a lot of topics on how to validate a newly constructed model in R (glm
class), but because I don't have the original dataset, I have to "reconstruct" the logistic regression to determine an individual's probability of the outcome. The article only presents prevalences and odds-ratios.
I thought of two strategies:
- Using the proportions and odds-ratios from the article, to work out the model's intercept (it's not given in the article), using $$
intercept = log(baseline\_odds) - (
log(OR_1) * X_1 + ... + log(OR_n) * X_n)
$$
with $X_i$ the prevalence of the study population ($0-1$) with the specified characteristic associated with $OR_i$.
I want to use this intercept to calculate the probability of the outcome in all individuals of my new population:
$$ odds = exp(intercept + log(OR_i) * Y_i)\\ p = odds / (1+odds) $$ so with the prevalences in the validation population. - Multiplying the odds in the new population with the odds ratios from the article, if that characteristic applies, so $$ odds = baseline\_odds * OR_i \\ p = odds / (1+odds) $$
Is my way of thinking correct?
I have multiple problems however:
- If I calculate the intercept, it's not exactly the same as the original intercept (I checked with a model with all parameters known). Is this all because of rounding errors?
I am aware that the intercept also contains some information from the specific population (like prevalence in the training population), and corrects for over- or underfitting. - The probabilities I compute are different for both strategies. This makes sense because option #2 ignores all the information that was stored in the intercept, but that option allows me to adjust for a different prevalence in the new population.
My question is which strategy I should use, and how I can tackle the problems I encounter.
Thank you in advance!
(of course I searched on StackExchange and Google, and I found a lot of articles about logistic regression, but unfortunately I couldn't get the answer from there.
I saw Help me understand adjusted odds ratio in logistic regression and Odds and odds ratios in logistic regression and Estimating predicted probabilities from logistic regression: different methods correspond to different target populations (but that one was too difficult for me), and many more)