21

I would like to understand how to generate prediction intervals for logistic regression estimates.

I was advised to follow the procedures in Collett's Modelling Binary Data, 2nd Ed p.98-99. After implementing this procedure and comparing it to R's predict.glm, I actually think this book is showing the procedure for computing confidence intervals, not prediction intervals.

Implementation of the procedure from Collett, with a comparison to predict.glm, is shown below.

I would like to know: how do I go from here to producing a prediction interval instead of a confidence interval?

#Derived from Collett 'Modelling Binary Data' 2nd Edition p.98-99
#Need reproducible "random" numbers.
seed <- 67

num.students <- 1000
which.student <- 1

#Generate data frame with made-up data from students:
set.seed(seed) #reset seed
v1 <- rbinom(num.students,1,0.7)
v2 <- rnorm(length(v1),0.7,0.3)
v3 <- rpois(length(v1),1)

#Create df representing students
students <- data.frame(
    intercept = rep(1,length(v1)),
    outcome = v1,
    score1 = v2,
    score2 = v3
)
print(head(students))

predict.and.append <- function(input){
    #Create a vanilla logistic model as a function of score1 and score2
    data.model <- glm(outcome ~ score1 + score2, data=input, family=binomial)

    #Calculate predictions and SE.fit with the R package's internal method
    # These are in logits.
    predictions <- as.data.frame(predict(data.model, se.fit=TRUE, type='link'))

    predictions$actual <- input$outcome
    predictions$lower <- plogis(predictions$fit - 1.96 * predictions$se.fit)
    predictions$prediction <- plogis(predictions$fit)
    predictions$upper <- plogis(predictions$fit + 1.96 * predictions$se.fit)


    return (list(data.model, predictions))
}

output <- predict.and.append(students)

data.model <- output[[1]]

#summary(data.model)

#Export vcov matrix 
model.vcov <- vcov(data.model)

# Now our goal is to reproduce 'predictions' and the se.fit manually using the vcov matrix
this.student.predictors <- as.matrix(students[which.student,c(1,3,4)])

#Prediction:
this.student.prediction <- sum(this.student.predictors * coef(data.model))
square.student <- t(this.student.predictors) %*% this.student.predictors
se.student <- sqrt(sum(model.vcov * square.student))

manual.prediction <- data.frame(lower = plogis(this.student.prediction - 1.96*se.student), 
    prediction = plogis(this.student.prediction), 
    upper = plogis(this.student.prediction + 1.96*se.student))

print("Data preview:")
print(head(students))
print(paste("Point estimate of the outcome probability for student", which.student,"(2.5%, point prediction, 97.5%) by Collett's procedure:"))
manual.prediction
print(paste("Point estimate of the outcome probability for student", which.student,"(2.5%, point prediction, 97.5%) by R's predict.glm:"))    
print(output[[2]][which.student,c('lower','prediction','upper')])
carbocation
  • 359
  • 1
  • 3
  • 9
  • A basic question, why is sqrt(sum(model.vcov * square.student)) assumed as the standard error? Isn't it the standard deviation and needs to be divided by sqrt(n)? If so, which n should be used, n used to fit the model or n of the new data frame used to predict? – Rafael Apr 04 '17 at 03:00

1 Answers1

7

Prediction intervals predict where the actual response data values are predicted to fall with a given probability. Since the possible values of the response of a logistic model are restricted to 0 and 1, the 100% prediction interval is therefore $ 0 <= y <= 1 $. No other intervals really make sense for prediction with logistic regression. Since it is always the same interval it generally is not interesting enough to generate or discuss.

Greg Snow
  • 46,563
  • 2
  • 90
  • 159
  • 6
    I'm looking for a 95% prediction interval of a prediction which is in log-odds space. Later I transform that to probability space. A 100% prediction interval would never be interesting for any procedure, right? For example, a 100% prediction interval for linear regression would include -Inf to Inf... At any rate, as you can see in my code, the prediction interval is calculated in log odds space, which is then transformed into probability space later. So I don't think my question is pointless. – carbocation Apr 17 '12 at 00:34
  • 2
    The log-odds can be converted to a probability and you can compute a confidence interval on the probability (or the log-odds). But a prediction interval is on the response variable which is 0 or 1. If your outcome is survival with 0=dead and 1=alive, then you can predict the probability of being alive for a given set of covariates and compute a confidence interval on that probability. But the outcome is 0/1, you can't have a patient that is 62% alive it has to be 0 or 1, so the only possible prediction intervals are 0-0, 0-1, and 1-1 (which is why most people stick to confidence intervals). – Greg Snow Apr 17 '12 at 00:57
  • Hmm, logically I do see what you're saying. 100% of the outcome data must be 0 or 1. It's one thing to have a 95% confidence that the mean prediction is between 52-61%. It would not make sense, as you say, to assign a 95% probability to having an outcome between 25-88% (it can never be in that window!). I don't have enough points to upvote yet, but I did accept your answer. Anyway, perhaps I'm looking for something by a different name. – carbocation Apr 17 '12 at 02:39
  • 9
    If you have a situation where the response is binomial (which could be an aggregate of 0-1s under the same conditions), then a prediction interval may make sense. – Glen_b May 24 '12 at 05:49
  • 8
    Logistic regression is regression of a probability, trying to model the probability of some event as a function of regressor variables. Prediction intervals in this setting is taken as intervals on the probability scale, or the log-odds scale, so makes perfect senes. – kjetil b halvorsen Jun 24 '15 at 14:23
  • 2
    @kjetilbhalvorsen Could you elaborate on how one might actually calculate these? – dimitriy Aug 19 '16 at 17:56
  • 1
    What if we computed the prediction intervals for the underlying linear regression that models data in odds-log space, then projected them using the logit transformation? Wouldn't this make sense? Couldn't the prediction intervals for the underlying linear regression be computed as described in http://www.real-statistics.com/regression/confidence-and-prediction-intervals/ ... ? – Cesar Sep 14 '16 at 20:31
  • 3
    @Cesar, the prediction interval formula is derived by assuming that Y is normally distributed about the line, but in logistic regression we don't have a normal distribution, we have a Bernoulli or Binomial. Applying the formulas on that page would lead to either a confidence interval (can already do this) or an artificially widened confidence interval that does not meet the definition of a prediction interval (predicting actual outcomes on the original outcome scale). Like Glen_b mentioned, a prediction interval may make sense if the outcome is truly binomial. – Greg Snow Sep 16 '16 at 15:27
  • 2
    @GregSnow We are diverging not only from what prediction intervals are, but directly form our view of probability. If you are Bayesian, then all probabilities are beliefs about world state. Then, a logistic regression would tell you q=p(y=1|x) (observing x, what's our belief that y is 1; note that it is different from saying y can either be 1 or 0). Then, you can have p(q|x), that is, your belief about the above belief. – hyiltiz Apr 11 '18 at 21:52
  • @hyiltiz, If I were a Bayesian (while I am learning more Bayesian methods and looking for more opportunities to apply them, I still use frequentist methods and concepts, so I don't know where I fit) I would call what you describe a "posterior predictive distribution", not a "Prediction Interval", the P.I. phraseology implies frequentist statistics to me. Both are important and help understanding, but they are different concepts. – Greg Snow Apr 12 '18 at 15:20
  • @hyiltiz, Additionally, for the straight Bernoulli case, I don't see the posterior predictive giving any more insight beyond the posterior distribution. In the Binomial case it would. – Greg Snow Apr 12 '18 at 15:36