Prediction intervals for the outcome of a logistic regression with binomial response

Question

Suppose we have a logistic regression model:

$$\begin{align} P(y=1\vert\mathbf{x}) &= p \\ \log\left(\frac{p}{1-p}\right) &= \boldsymbol{\beta}\mathbf{x} \end{align}$$

Given a random sample $D=\{\mathbf{X},\mathbf{y}\}$ of size $N$, we can compute confidence intervals for the $\boldsymbol{\beta}$ and correspondingly prediction intervals for $p$, given a certain value $\mathbf{x}^*$ of the predictor vector. This is all very standard and detailed, for example, here.

Suppose instead that I'm interested in a prediction interval for $y$, given $\mathbf{x}^*$. Of course, it doesn't make any sense at all to compute a prediction interval for a single realization of $y$, because $y$ can only take the values 0 and 1, and no value in between. However, if we consider $m$ realizations of $y$ for the same fixed value of $\mathbf{x}^*$ , then this becomes similar (but not identical) to the question of computing a prediction interval for a binomial random variable. This is basically the same situation described by Glen_b in the comments to this answer. Does this question have an answer, apart from the trivial one "use nonparametric bootstrap"?

can you compute a prediction interval for $log(p / (1-p))$ instead perhaps? — Hugh Perkins, Mar 31 '18 at 01:15
@HughPerkins I think that the issue is how to combine the uncertainty in _p_ with the uncertainty in binomial sampling also given the uncertainty in _p_. Is there a closed-form solution? — EdM, Mar 31 '18 at 02:17
@EdM you got my point. I wonder if there is a closed form solution or an analytical approximation. — DeltaIV, Mar 31 '18 at 12:03
[offtopic] random idea, it occurs to me that it could be interesting to have a tag like 'open-research-opportunity' for questions like this which/if they are answered in the negative — Hugh Perkins, Mar 31 '18 at 16:05

score 4 · Accepted Answer · answered Apr 01 '18 at 05:45

4

One way this should work without bootstrapping (which in practice may be the fastest thing tho implement), would be:

Assume that a normal approximation for the predicted log-odds ($x \hat{\beta}$) plus/minus its standard error works. Any logistic regression software will provide this.
The percentiles of this distribution transform to probabilities via the anti-logit.
One can find a (mixture of) beta distribution(s) that approximates the predictive distribution for the probability well.
The predictive distribution for the outcome is then a (mixture of) beta-binomial distribution(s with the same mixing weights as used in step 3).

Alternatively, one can "just" integrate out the log-odds from the joint predictive of outcome and log-odds, but I believe that will be a complete mess with no closed form solution.

answered Apr 01 '18 at 05:45

Björn

21,227
2
26
65

4

You could also just directly simulate from the asymptotic multivariate normal for $\beta-\hat{\beta}$, and then form a mixture of binomials over those values. – Glen_b Apr 01 '18 at 06:28
I like the overall idea, but I'm not sure about the details. For example, "find a (mixture of) beta distribution(s) that approximates the predictive distribution for the probability well", How do you this in practice? Could you add an example? Even a low-dimensional one would suffice. – DeltaIV Apr 02 '18 at 06:07
About your second alternative, if I understand correctly you suggest to write down the joint posterior distribution of outcomes and log-odds (which is a joint distribution between continuous and discrete variables) and integrate out log-odds. I definitely agree that that's going to be a mess. One could salvage the approach using MCMC to estimate the integral, I think. However, I would need MCMC algorithms which work for discrete-continuous variables...I'm not sure which one do. I think `rstan` (my tool of choice for MCMC) doesn't. – DeltaIV Apr 02 '18 at 06:11
The proposal by Glen_b is of course very easy: simulate from MVN for the betas, form log - odds, convert too probability, simulate binary outcomes. – Björn Apr 02 '18 at 06:14
Bjorn, the same question holds for you. How many components should I use in the mixture of binomials? How do I estimate $\alpha_i$ and $\beta_i$ for each component of the mixture? I'm not very familiar with density estimation, unfortunately. Even a reference to a useful package would be enough: I don't expect you two to lay down the theory of beta mixture estimation for me (that would be another question). – DeltaIV Apr 02 '18 at 06:14
@DeltaIV My suggestion has the same number of components as values you simulated from the MVN; let's say you have a single prediction to create and want to generate $N$ simulations, then you can set up $n+1$ bins to hold the distribution at any value of $\mathbf{x}$ you're predicting for. At each step you simulate your parameter difference vector $\beta-\hat{\beta}$ and then you have a conditional binomial for the prediction. You multiply that conditional pmf by 1/N and add that into the $n+1$ probabilities you're collecting. (These days this simulation would be called a parametric bootstrap.) – Glen_b Apr 02 '18 at 07:34
However, if you only had a single predicted value, you can of course shortcut this to essentially a numerical integration\*. For that, compute instead the distribution of $[\beta-\hat{\beta}]x$ (keeping in mind this is 'backwards', then shift it by the constant $\hat{\beta}x$), and across a fine grid of values over that distribution do a probability-weighted version of the suggestion in the previous comment. – Glen_b Apr 02 '18 at 07:48
@Glen_b sorry - my comments didn’t make sense. I kept mixing betas, binomials & beta-binomials in my mind. I’m on holiday now and apparently my mind is on holiday too :-) Once I’m back I’ll try to rewrite your comments as an answer and implement it into an R code - not b/c I plan to accept it, but just because this way I’ll be forced to follow the train of thoughts. Thanks to both of you – DeltaIV Apr 03 '18 at 10:15
3

I can write this up as something in the form of an answer if you prefer -- I don't mind either way. – Glen_b Apr 03 '18 at 14:58
2

@Glen_b I'd really appreciate that. – DeltaIV Apr 06 '18 at 18:25
1

@Glen_b, I would be interested in seeing that answer. – Richard Hardy Jun 24 '19 at 08:56
@DeltaIV, I would be interested in seeing that answer. – Richard Hardy Jun 24 '19 at 08:56

Prediction intervals for the outcome of a logistic regression with binomial response

1 Answers1

Linked