2

I'm running across an interesting case where the estimated uncertainty based on the Bayesian simulation approach outlined in Simon Wood's book (Generalized Additive Models (2004), section 4.8; as coded in the examples of help(predict.gam)) is giving surprisingly large uncertainty for logistic regression predictions where the point prediction is very near 0.

Here's example R code for simplified problem. Note that the true probability of success is low for a large part of the domain and this is where the uncertainty blows up for the Bayesian approach for the response but not the linear predictor (and not for the se.fit values).

library(mgcv)    

x = (1:200)/200
# note that the empirical probabilities are 0 except 
# in the middle of the domain
y = c(rep(0, 100), 3, 7, 12, 15, 12, 20, 15, 9, 5, 8, 3, rep(0, 89))
z = cbind(y, 100 - y)

k <-  150
mod = gam(z ~ s(x, k = k), family = 'binomial')

rmvn <- function(n, mu, sig) { ## MVN random deviates
  L <- mroot(sig); m <- ncol(L);
  t(mu + L %*% matrix(rnorm(m*n), m, n)) 
}

n <- 500

# Bayesian uncertainty calculation, as per help(predict.gam)
Xp <- predict(mod, type="lpmatrix")
draws_coef <- rmvn(n , coef(mod), mod$Vp) 
draws_lp <- Xp %*% t(draws_coef)  # type = "link"
draws <- exp(draws_lp) / ( 1+exp(draws_lp))  # type = "response"

# Bayesian standard deviation
sds = apply(draws,1,sd)

# plot to show Bayesian standard deviation is very large where
# predictions are near zero, unlike standard errors
par(mfrow = c(1,3))
plot(x, predict(mod, type = 'response'))
plot(x, sds)
# mgcv's estimated prediction standard error
sef = predict(mod, type = 'response', se.fit=TRUE)$se.fit
plot(x, sef)

# Bayesian standard deviation and standard error very similar
# for linear predictor
par(mfrow = c(1,2))
sds_lp <- apply(draws_lp, 1, sd)
plot(x, sds_lp)
plot(x, predict(mod, se.fit=TRUE)$se.fit)

Comparing to the output of predict.gam with se.fit=TRUE, the Bayesian standard deviation and the predict.gam standard error are very similar for for the linear predictor (and those uncertainties are very large where the success probability is very near zero), but for the reponse scale the standard error is small where the Bayesian uncertainty is large. Mathematically this seems to just be a consequence of the fact that the se.fit calculation in predict.gam() seems to involve applying the delta method to the type="link" se.fit values, and this scales the type="response" values down a lot where the fitted probabilities are near zero.

Does anyone have any insights to help explain the large uncertainty? Presumably some sort of instability in parts of the domain where the linear predictor is very negative, but I'd like to understand it better and figure out if there is a workaround, as I need posterior draws for downstream processing and not just pointwise standard errors.

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
  • 1
    Please include library calls in your code. I'm assuming `gam` comes from the `gam` library, but then I get an error since the function `s` does not have an argument k, did you mean df? – jaradniemi Mar 09 '16 at 16:51
  • 1
    Just include "library(mgcv)" before running the code. I'm using gam() from mgcv not from the gam package. The s() function in mgcv does have a "k" argument. I'm using mgcv version 1.8-9. – Chris Paciorek Mar 10 '16 at 16:09
  • I don't believe this question is a duplicate nor that the answer in the other post answers this question (at least not directly, though the answers may be related). The other question/answer regards a glm fit rather than a spline-based fit using mgcv::gam in this question. Furthermore this question is about the quasi-Bayesian posterior simulation strategy suggested for use with mgcv::gam. – Chris Paciorek May 17 '19 at 20:16

0 Answers0