I'm running a GAMM in R on count data with multiple explanatory variables, including both continuous variables with smoothers and factors, a random effect, and a Poisson distribution using the gamm4 package. After running the model, I have several explanatory variables that are signficant with p-values in the range of 0.02, 0.03, but Zuur et al. (2009) and Wood (2006) suggest that these p-values cannot be trusted "...based on limited simulation experience, the p-values close to 0.05 can be around half of their correct value when the null hypothesis is true." They suggest bootstrapping to get better p-values.
I've found several posts demonstrating how to do this with a single explanatory variable, but nothing satisfactory demonstrating how to bootstrap to get better p-values when a model has multiple explanatory variables.
A subset of my data looks like this:
Response variable (y) is a count of individuals: 196 151 195 134 129 144 48 97 234 115 146 210 305 42 113 76 25 131 140 283 103 120 193 196 311 125 323 248 88 70 183 170 164 121 154 243 312 181 98 200 217 199 168 144 69 290 264 127 137 190 84 267 213 58 228 78 123 71 72 99 109 36 103 180 56 52 57 210 115
Explanatory variable 1 (x1) is a measure of metal #1 concentrations (continuous data): 2.040546 1.918037 1.955820 1.939079 1.954283 1.948943 2.100827 1.888521 2.058084 2.390421 2.067924 1.916467 2.090603 2.130918 2.195256 2.067476 2.106653 1.904094 2.148486 1.830157 2.050592 1.966457 1.949843 1.998073 2.074533 1.943336 1.891463 1.840778 2.101109 2.117568 1.943524 1.967779 1.830191 2.062435 2.006324 1.902633 2.042048 2.098815 2.065316 2.134942 2.006954 2.286689 2.107825 2.044101 2.073800 2.185401 1.856314 2.103332 2.161628 1.842235 2.187495 2.028387 1.763994 2.048215 1.992492 2.114086 1.986935 2.331255 2.198746 2.067275 2.130385 2.232669 2.210650 1.979862 2.105747 2.125604 2.042671 2.109569 2.032746
Explanatory variable 2 (x2) is a measure of metal #2 concentrations (continuous data): 3.2978 3.1817 4.4691 6.2636 6.6527 3.6359 6.6887 4.0395 7.1494 8.9839 5.6024 2.7550 5.2336 4.2051 7.0592 4.5547 8.1309 3.0459 5.8556 4.0165 3.6044 5.9570 1.8680 1.4859 3.6910 1.2615 3.4319 2.5698 2.3036 6.0554 4.1815 5.3875 4.0644 10.2285 2.9753 1.8285 5.7248 5.4903 5.7735 3.5509 2.3647 5.6811 4.0478 1.8132 9.0615 4.4908 1.6612 5.4674 7.2074 2.6929 6.5781 4.2749 1.1909 3.6279 2.7929 4.1203 2.6168 2.5238 2.2740 6.4021 5.3661 3.5428 3.9439 2.4381 4.5586 4.3546 2.5190 8.0448 2.9777
Explanatory variable 3 (x3) is a factor: 1 0 0 0 0 0 0 0 0 1 0 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 1 1 0 1 0 0 0 0 0 0
Random factor (site): 1 2 3 3 3 4 4 5 5 6 6 7 7 8 8 8 9 10 10 11 11 11 12 12 13 13 13 14 14 14 15 15 16 16 17 17 17 18 18 19 19 19 20 20 20 21 21 21 22 22 22 23 23 23 24 24 25 26 26 27 27 28 28 30 30 31 32 33 34
Using this post and this post, I came up with:
library(gamm4)
df <- data.frame(y=y, x1=x1, x2=x2, x3=as.factor(x3), site = as.factor(site))
n <- nrow(df)
nboot <- 1000
for (i in 1:nboot) {
k.star <- sample(n, replace = TRUE)
boot.mod <- gamm4(y ~ s(x1) + s(x2) + x3, random = ~(1|site), data = df[k.star, , drop = FALSE], family = poisson)
newdata <- data.frame(df[k.star, , drop=F])
predterms <- predict(boot.mod, newdata, type = "terms")
pred <- attr(predterms, "constant") + rowSums(predterms)
}
That is as far as I got. The post goes on to say "You could then plot each column of pred
against the relevant column from newdata
to give the bootstrap smooth for each term. I would do that with lines()
to add them to the plot. But I'm not sure how to do this or how doing this will get me either (1) confidence intervals for s(x1) and s(x2) so I can see if those intervals overlap zero and determine significance that way or (2) p-values for s(x1) and s(x2).
Thank you in advance for any advice/guidance.