In my line of work I have seen people quantify general model uncertainty, when using a logistic regression model, with an Agresti-Coull confidence interval. I am not convinced that this is correct, and I have not gotten a good explanation, nor reference that this is actually correct.
Example:
Let's say you have a dataset with 60.000 customers, where 58.000 are so-called "good" customers and 2000 are "bad" customers (the definition of "good" and "bad" is not important for the point here). The dependent variable takes the value 1, when the customer is "bad" and 0 when the customer is "good". It is then possible to use a logistic regression model to model the probability that a customer is "bad".
In order to quantify the general model uncertainty, the Agresti-Coull confidence interval is then used. In practice any binomial proportion confidence interval could be used. In the example below, a one-sided upper 95% confidence interval is used. Following Wikipedia, Agresti–Coull interval, I calculate the interval as,
- The proportion of "bad" customers is calculated as, $$\hat{p}=\frac{2.000}{60.000}=0.0333\space (3.33\%)$$
- $$z_{\alpha}=z_{0.05}=z=1.65$$
- Given $X$ successes in $n$ trials, define $$\tilde{n}=n+z^{2}=60.000 +1.65^{2}=60002.72$$ and $$\tilde{p}=\frac{1}{\tilde{n}}(X+\frac{z^{2}}{2})=0.0333\space(3.34\%)$$
- The (one-sided) upper interval is then calculated as, $$UL_{AC,95}=\tilde{p}+z\sqrt{\frac{\tilde{p}}{\tilde{n}}(1-\tilde{p})}=0.0346\space(3.46\%)$$ Given the above we can then calculate the general model uncertainty as the ratio of the upper limit of the Agresti-Coull interval and the observed proportion of "bad" customers. $$MODUNC=\frac{0.0346}{0.0333}=1.0369$$ From the above, we conclude that the general model uncertainty is around 3.69%.
Discussion:
One of the arguments goes that because the logistic regression model is calibrated to the level of "bad" customers in the dataset (the $3.33\%$ above), then we can use a binomial proportion confidence interval to quantify some sort of model uncertainty. However, the above approach does not take into account the number of explanatory variables, estimation method etc. Since the above method does not take into account any of the actual modelling choices, but only information about the sample I cannot see the rationale behind it, except that the estimated probabilities from the logistic regression model will equal the "bad" proportion in the dataset.
Further, I believe that general model uncertainty, should rather reflect the confidence intervals around the estimated probabilites from the logistic regression model. This is also what textbooks and other questions/answers on CrossValidated say.
Questions:
- Is the above method valid to quantify general model uncertainty? I believe its wrong.
- If it is a valid method, could someone give me some references, or explanation on why it is valid?
- How would you quantify model uncertainty in a logistic regression model? Using the estimated probabilities and their confidence limits, or some other way? Could you give some references?