Purposely restricting k in mgcv's gam()

Question

The advice for choosing K is to set it as high as possible, while managing the trade-off with computation time (e.g. Choosing k in mgcv's gam()).

However, is it acceptable to restrict K to avoid overly complicated smooths that are likely to be biologically unrealistic? And would this impact on the model checking process (via randomised quantile residuals)?

For example, I am modelling the daily activity cycles of foxes, using data collected with camera-traps. Camera-traps just provide a snapshot of behaviour when the animal happens to walk in front of them (opposed to something like GPS collars where you get the full picture). A model with k = 10 produces more wiggliness than I believe to be realistic - I think this is more likely an artefact of the imperfect sampling process. On the other hand, a model with k = 5 looks more like what I'd expect, however gam.check() hints that k is set too low. See below:

I guess I am mainly concerned about (i) arbitrarily parametrising models to meet my expectations, (ii) justifying this in the manuscript and (iii) whether this will effect the model-checking process. Am I being sketchy or just overthinking this?

Additionally, I am restricting k for another term which should be specified as a linear term (the activity of one species declining with the activity of another), but this way all my covariates are subject to the same double penalty approach for model selection (as recommended by Gavin Simpson here GAM selection when both smooth and parametric terms are present). You can see how I am specifying this model (without the inclusion of other covariates) below:

model <- bam(fox ~ s(hour, bs = "cc", k = 5) + s(predicted_predator_activity, bs = "ts", k = 3), data = data, family = binomial, select = TRUE)

score 2 · Accepted Answer · answered May 06 '20 at 17:18

The penalised spline way of smoothing is a bit of a multi-faceted thing. The Bayesian view of smoothing — that you're implicitly using when fitting with REML or ML in mgcv —, would view the smoothness parameter(s) in the model as priors on the wiggliness of the functions.

From this viewpoint therefore, it is acceptable to constrain the size of the basis to meet the a priori expected wiggliness of the smooth effects you're estimating.

That said, with temporal (and spatial) variables such as the one we're discussing here, you do have to be somewhat careful not to violate the theory upon which the statistical tests you might be using (p values etc). A critical assumption is conditionally independent observations; once we've accounted for the model, observations are independent. This would be violated in the case of unmodelled temporal or spatial autocorrelation.

If you go with the more smooth effect of hour, you may be missing some temporal structure in the data.

I would personally go with the more smooth version and then plot the deviance residuals against hour and use a variogram or the autocorrelation function to look for unmodelled temporal structure. If there, you could use the rho argument to model it (assuming equally spaced observations).

A note on your model: I wouldn't combine "ts" bases with select = TRUE - that's a lot more penalties. You should be fine with the "tp" basis and select = TRUE. Also, you should be using knots = list(hour = c(0, 24)) in the model to set the end points for the "cc" cyclic smooth basis.

Purposely restricting k in mgcv's gam()

1 Answers1