Does there exist theory behind how many knots one should use in a stan_gamm4() model for a given set of UNIQUE covariates?

Question

Currently, I am working with trying to fit a non-linear model with stan_gamm4. This can be done through specifying smoothing functons, such as:

s(...,k = -1). The following is some sample data that demonstrates issues I am unclear on:

library(mgcv)
library(rstan)
library(rstanarm)
data(trees)
> head(trees)
  Girth Height Volume
1   8.3     70   10.3
2   8.6     65   10.3
3   8.8     63   10.2
4  10.5     72   16.4
5  10.7     81   18.8
6  10.8     83   19.7
> dim(trees)
[1] 31  3
> length(unique(trees$Girth))
[1] 27
> length(unique(trees$Height))
[1] 21
> length(unique(trees$Volume))
[1] 30

Now I will regress Volume on Girth and Height, but with k=22 on Height:

> out1 <- stan_gamm4(Volume ~ s(Height, k = 22) + s(Girth, k = -1), family = Gamma(link = log), data = trees, iter = 1000)
Error in smooth.construct.tp.smooth.spec(object, dk$data, dk$knots) : 
  A term has fewer unique covariate combinations than specified maximum degrees of freedom

I understand that the number of unique covariates on Height must be less than or equal to 21. However, I am unsure how to determine the optimal knots to be used for each variable.

1) I had assumed that the s(Girth, k = -1) smoother on Girth with k=-1 will automatically set it using Generalized Cross Validation according to this post: Selecting knots for a GAM. However, it seems that with k=-1 the default is always k=10.

2) I have read in some posts that one should use one-tenth of the unique covariates. However, I am unsure whether there are any papers to back this up for an object like stan_gamm4.

Does anyone have any idea how I can determine what might be the best number of knots k to use? Is it the case that the closer we get to the unique values above the better? Thanks.

score 2 · Accepted Answer · answered Jun 26 '18 at 14:14

A Bayesian answer to the question of what k to specify (or what bs to specify or what predictors to include in the spline or how to specify the model more generally) are

Which specification is expected to predict future data best? Or what vector of weights on the predictions produced by the various models is expected to predict future data best?
Which specification is true, conditional on one of the specifications tried being true?

The first is what is implemented in the loo R package for estimated leave-one-out cross-validation with Pareto Smoothed Importance Sampling, which is what the loo method for an object produced by stan_gamm4 calls. Or you can call the loo_model_weights function. The second is what is implemented in the bridgesampling R package for posterior probabilities over models, which also has a method for an object produced by stan_gamm4 (although you have to specify the diagnostic_file argument to stan_gamm4 in order to write the necessary information to the disk).

However, the answer is often that "k does not matter very much" (due to the hyperpriors), which is the same as the frequentist answer (due to the penalization).

Does there exist theory behind how many knots one should use in a stan_gamm4() model for a given set of UNIQUE covariates?

1 Answers1