1

In my line of work I have seen people quantify general model uncertainty, when using a logistic regression model, with an Agresti-Coull confidence interval. I am not convinced that this is correct, and I have not gotten a good explanation, nor reference that this is actually correct.

Example:

Let's say you have a dataset with 60.000 customers, where 58.000 are so-called "good" customers and 2000 are "bad" customers (the definition of "good" and "bad" is not important for the point here). The dependent variable takes the value 1, when the customer is "bad" and 0 when the customer is "good". It is then possible to use a logistic regression model to model the probability that a customer is "bad".

In order to quantify the general model uncertainty, the Agresti-Coull confidence interval is then used. In practice any binomial proportion confidence interval could be used. In the example below, a one-sided upper 95% confidence interval is used. Following Wikipedia, Agresti–Coull interval, I calculate the interval as,

  1. The proportion of "bad" customers is calculated as, $$\hat{p}=\frac{2.000}{60.000}=0.0333\space (3.33\%)$$
  2. $$z_{\alpha}=z_{0.05}=z=1.65$$
  3. Given $X$ successes in $n$ trials, define $$\tilde{n}=n+z^{2}=60.000 +1.65^{2}=60002.72$$ and $$\tilde{p}=\frac{1}{\tilde{n}}(X+\frac{z^{2}}{2})=0.0333\space(3.34\%)$$
  4. The (one-sided) upper interval is then calculated as, $$UL_{AC,95}=\tilde{p}+z\sqrt{\frac{\tilde{p}}{\tilde{n}}(1-\tilde{p})}=0.0346\space(3.46\%)$$ Given the above we can then calculate the general model uncertainty as the ratio of the upper limit of the Agresti-Coull interval and the observed proportion of "bad" customers. $$MODUNC=\frac{0.0346}{0.0333}=1.0369$$ From the above, we conclude that the general model uncertainty is around 3.69%.

Discussion:

One of the arguments goes that because the logistic regression model is calibrated to the level of "bad" customers in the dataset (the $3.33\%$ above), then we can use a binomial proportion confidence interval to quantify some sort of model uncertainty. However, the above approach does not take into account the number of explanatory variables, estimation method etc. Since the above method does not take into account any of the actual modelling choices, but only information about the sample I cannot see the rationale behind it, except that the estimated probabilities from the logistic regression model will equal the "bad" proportion in the dataset.

Further, I believe that general model uncertainty, should rather reflect the confidence intervals around the estimated probabilites from the logistic regression model. This is also what textbooks and other questions/answers on CrossValidated say.

Questions:

  1. Is the above method valid to quantify general model uncertainty? I believe its wrong.
  2. If it is a valid method, could someone give me some references, or explanation on why it is valid?
  3. How would you quantify model uncertainty in a logistic regression model? Using the estimated probabilities and their confidence limits, or some other way? Could you give some references?
Plissken
  • 1,426
  • 1
  • 10
  • 17
  • The answers to https://stats.stackexchange.com/questions/96733/generate-predictions-from-a-logistic-regression-model-reflecting-the-uncertainty, point towards that the above method is wrong. – Plissken Jan 10 '22 at 11:41
  • you are right that it's wrong for the reasons you outlined. see eg https://stats.stackexchange.com/questions/539702/logistic-regression-how-to-compute-a-prediction-interval ( there is a slight confusion whether its named the confidence or prediction interval on the probability) – seanv507 Jan 10 '22 at 11:48
  • What you are calling model uncertainty is not that but is rather the margin of error. One example of margin of error is 1/2 the width of a 0.95 compatibility (aka confidence) interval. Best to do this on the risk scale or the odds scale rather than a risk ratio scale. Note that you are not dealing with _rates_ but rather are dealing with probabilities and proportions. _Model uncertainty_ may be quantified using stability analysis with the bootstrap as discussed throughout [here](https://hbiostat.org/rms). – Frank Harrell Jan 10 '22 at 13:16
  • @seanv507 Thank you for the response. Yes, I can see there is some confusion, whether its called a confidence interval or a prediction interval :) – Plissken Jan 11 '22 at 08:08
  • @FrankHarrell Thank you for the response. I actually have the 1st ed. of your book (1st ed.) which is great. Are you agreeing that the above mentioned method, is used incorrectly? Is your suggestion for model uncertainty to use the bootstrap as is done here: https://rpubs.com/vadimus/bootstrap?... If I want to quantify the dispersion of the distribution of the statistical estimator, should I then only use the confidence intervals around the estimated parameters, or also show prediction intervals (or are they also called confidence intervals?) around the estimated probabilities? – Plissken Jan 11 '22 at 08:10
  • 1
    @FrankHarrell Thank you for the correction. I have changed "rate" to "proportion" in the post. – Plissken Jan 11 '22 at 08:43

0 Answers0