Getting better p-values for a gam model with multiple explanatory terms using bootstrapping

Question

I'm running a GAMM in R on count data with multiple explanatory variables, including both continuous variables with smoothers and factors, a random effect, and a Poisson distribution using the gamm4 package. After running the model, I have several explanatory variables that are signficant with p-values in the range of 0.02, 0.03, but Zuur et al. (2009) and Wood (2006) suggest that these p-values cannot be trusted "...based on limited simulation experience, the p-values close to 0.05 can be around half of their correct value when the null hypothesis is true." They suggest bootstrapping to get better p-values.

I've found several posts demonstrating how to do this with a single explanatory variable, but nothing satisfactory demonstrating how to bootstrap to get better p-values when a model has multiple explanatory variables.

A subset of my data looks like this:

Response variable (y) is a count of individuals: 196 151 195 134 129 144  48  97 234 115 146 210 305  42 113  76  25 131 140 283 103 120 193 196 311 125 323 248  88  70 183 170 164 121 154 243 312 181  98 200 217 199 168 144  69 290 264 127 137 190  84 267 213  58 228  78 123  71  72  99 109  36 103 180  56  52  57 210 115

Explanatory variable 1 (x1) is a measure of metal #1 concentrations (continuous data): 2.040546 1.918037 1.955820 1.939079 1.954283 1.948943 2.100827 1.888521 2.058084 2.390421 2.067924 1.916467 2.090603 2.130918 2.195256 2.067476 2.106653 1.904094 2.148486 1.830157 2.050592 1.966457 1.949843 1.998073 2.074533 1.943336 1.891463 1.840778 2.101109 2.117568 1.943524 1.967779 1.830191 2.062435 2.006324 1.902633 2.042048 2.098815 2.065316 2.134942 2.006954 2.286689 2.107825 2.044101 2.073800 2.185401 1.856314 2.103332 2.161628 1.842235 2.187495 2.028387 1.763994 2.048215 1.992492 2.114086 1.986935 2.331255 2.198746 2.067275 2.130385 2.232669 2.210650 1.979862 2.105747 2.125604 2.042671 2.109569 2.032746

Explanatory variable 2 (x2) is a measure of metal #2 concentrations (continuous data): 3.2978  3.1817  4.4691  6.2636  6.6527  3.6359  6.6887  4.0395  7.1494  8.9839  5.6024  2.7550  5.2336  4.2051  7.0592 4.5547  8.1309  3.0459  5.8556  4.0165  3.6044  5.9570  1.8680  1.4859  3.6910  1.2615  3.4319  2.5698  2.3036  6.0554 4.1815  5.3875  4.0644 10.2285  2.9753  1.8285  5.7248  5.4903  5.7735  3.5509  2.3647  5.6811  4.0478  1.8132  9.0615 4.4908  1.6612  5.4674  7.2074  2.6929  6.5781  4.2749  1.1909  3.6279  2.7929  4.1203  2.6168  2.5238  2.2740  6.4021 5.3661  3.5428  3.9439  2.4381  4.5586  4.3546  2.5190  8.0448  2.9777

Explanatory variable 3 (x3) is a factor: 1 0 0 0 0 0 0 0 0 1 0 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 1 1 0 1 0 0 0 0 0 0

Random factor (site): 1  2  3  3  3  4  4  5  5  6  6  7  7  8  8  8  9  10 10 11 11 11 12 12 13 13 13 14 14 14 15 15 16 16 17 17 17 18 18 19 19 19 20 20 20 21 21 21 22 22 22 23 23 23 24 24 25 26 26 27 27 28 28 30 30 31 32 33 34

Using this post and this post, I came up with:

library(gamm4)
df <- data.frame(y=y, x1=x1, x2=x2, x3=as.factor(x3), site = as.factor(site))
n <- nrow(df)
nboot <- 1000

for (i in 1:nboot) {
  k.star <- sample(n, replace = TRUE)
  boot.mod <- gamm4(y ~ s(x1) + s(x2) + x3, random = ~(1|site), data = df[k.star, , drop = FALSE], family = poisson)
  newdata <- data.frame(df[k.star, , drop=F])
  predterms <- predict(boot.mod, newdata, type = "terms")
  pred <- attr(predterms, "constant") + rowSums(predterms) 
}

That is as far as I got. The post goes on to say "You could then plot each column of pred against the relevant column from newdata to give the bootstrap smooth for each term. I would do that with lines() to add them to the plot. But I'm not sure how to do this or how doing this will get me either (1) confidence intervals for s(x1) and s(x2) so I can see if those intervals overlap zero and determine significance that way or (2) p-values for s(x1) and s(x2).

Thank you in advance for any advice/guidance.

score 1 · Answer 1 · answered Jun 18 '21 at 18:51

Although Wood 2006 said that p-values shouldn't be trusted, Wood 2013 said that they should be fine. See help("summary.gam") for the references and further explanation. Computation of p-values does not take into account uncertainty of the smoothness parameter, but according to the documentation, this should not be a problem unless the uncertainty is too high.

Furthermore, Wood in the GAM book section 6.10.3 says the following about the bootstrap:

An alternative might be to use bootstrapping to quantify uncertainty, but this is typically much more expensive, requiring a model fit per bootstrap replicate. In addition bootstrapping is somewhat problematic when there are penalties present. Non-parametric bootstrapping results in some data appearing twice in the bootstrap sample, which leads to under-smoothing (consider leave-one-out cross validation to see why this will happen). Parametric bootstrapping is made awkward by the presence of smoothing bias in the parametric model from which the samples are generated. If the expense of bootstrapping is tolerable, then a better approach to more complete uncertainty quantification is to take a full Bayesian approach, put priors on the smoothing parameters, and simulate directly from the posterior of the model coefficients.

score 0 · Answer 2 · edited Sep 26 '17 at 12:41

0

I think this is what you need.

library(boot)
reg.boot <- function(YourData, i, P_Value){
  DataForBoot <- YourData[i,]
   boot.mod <- gamm4(y ~ s(x1) + s(x2) + x3, random = ~(1|site), data = DataForBoot, family = poisson)
sum <- summary( boot.mod )
Table_coefs <- sum$s.table
P_Value_Temp <- Table_coefs[,4]}
bootSsp <- boot(Data_anaNoCeros , reg.boot, R=1000)
ci1<- boot.ci(bootSsp , type="perc", index=1)
ci2<- boot.ci(bootSsp , type="perc", index=2)
ci3<- boot.ci(bootSsp , type="perc", index=3)
ci4<- boot.ci(bootSsp , type="perc", index=4)

edited Sep 26 '17 at 12:41

answered Sep 26 '17 at 10:42

José

1

An attempted editor comments, "Thank you for this answer! I couldn't get it to work as is, but with a little tinkering I realized that I just needed to add `$gam` to the summary command. This seems to be something particular with gamm4 models." – gung - Reinstate Monica Oct 10 '17 at 20:19

Getting better p-values for a gam model with multiple explanatory terms using bootstrapping

2 Answers2