11

When selecting an appropriate number of knots for a GAM one might want to take into account the number of data and increments on the x-axis.

What if we have 100 increments on the x-axis with 1000 data points at each increment.

The info here says:

If they are not supplied then the knots of the spline are placed evenly throughout the covariate values to which the term refers: For example, if fitting 101 data with an 11 knot spline of x then there would be a knot at every 10th (ordered) x value.

So a basic start should be 9 knots in this example? I am just not sure what range of knots would be suitable for this data set as it is possible to fit very small to very large numbers.

set.seed(1)
dat <- data.frame(y = rnorm(10000), x = 100)

library(ggplot)
ggplot(dat, aes(x = x, y = y)) + 
              geom_point(size= 0.5) +                      
stat_smooth(method = "gam", 
            formula = y ~ s(x, bs = "cs"),k=9, col = "black")

If k=25 provided a useful fit, would it be reasonable for this data?

user1320502
  • 837
  • 4
  • 11
  • 22

2 Answers2

17

Where is the idea coming from that GCV will automatically choose the number of knots? The number of knots (i.e., the basis dimension) is fixed and cannot be changed during model fit. What the GCV score in function gam() is doing "automatically" is not choosing the basis dimension k, as Ira S says, but is choosing the smooth level of each basis spline by introducing a wigliness penalty in the minimizer or fitting goal. To choose the number of knots k you should use a value larger than the number of degrees of freedom you are expecting. Quoting the help of choose.k: "exact choice of k is not generally critical: it should be chosen to be large enough that you are reasonably sure of having enough degrees of freedom to represent the underlying ‘truth’ reasonably well, but small enough to maintain reasonable computational efficiency". So, basically increase k in large steps until you see no changes in your plot, for instance. Summarizing: There is nothing like an "automatic" choice for k as Ira S is saying, the user should always choose a k value as part of model design. Otherwise you are most probably under-fitting your model!

nukimov
  • 562
  • 3
  • 13
  • just another clarification question. In the package mgcv vignette file, it says k is the dimension of the basis of variables that the smooth is a function of. When using bs = "cr", the cubic regression splines, I thought the dimension of the basis is 3. That is, k = 3 when bs = "cr", was I wrong? – vtshen Jul 01 '19 at 15:12
  • Almost right but not quite. What you say would be true if a spline would consist of a single 3rd degree polynomial, which is only a special case of a spline. A spline is a series of concatenated polynomials (normally more than one). The basis splines used to construct the smoothing splines consist of many polynomials joining on the knots, the more knots you have, the more degrees of freedom. That is why k is intrinsically related to the number of knots, wich is detailed described in Simon Woods book. – nukimov Jul 02 '19 at 19:41
11

A much better option is to fit your model using gam() in the mgcv package, which contains a method called Generalized Cross-validation (GCV). GCV will automatically choose the number of knots for your model so that simplicity is balanced against explanatory power. When using gam() in mgcv, turn GCV on by setting k to equal -1.

Just like this:

set.seed(1)
dat <- data.frame(y = rnorm(10000), x = rnorm(10000))

library(mgcv)
G1 <- gam(y ~ s(x, k = -1, bs = "cs"), data = dat)
summary(G1) # check the significance of your smooth term
gam.check(G1) # inspect your residuals to evaluate if the degree of smoothing is good

To plot your smooth line you will have to extract the model fit. This should do the trick:

plot(y~x, data = dat, cex = .1)
G1pred <- predict(G1)
I1 <- order(dat$y)
lines(dat$x, G1pred)

You can also adjust k manually, and see what number of k brings you closest to the k value set automatically by GCV.

HelloWorld
  • 175
  • 7
Ira S
  • 510
  • 6
  • 15
  • What does the `bs = "cs"` term in the spline do? – user321627 Jun 25 '18 at 00:46
  • 1
    "cs" specifies that the basis for smoothing shall be a cubic spline. – Manuel Bickel Jun 26 '18 at 19:59
  • Is not specifying `k` equivalent to specifying `k=-1`? – Nakx Jan 11 '19 at 10:58
  • Not sure I totally understand Nakx, but I'll clarify that k= -1 will allow the model to determine an optimal number of nodes using Generalized Cross Validation. Adjusting manually will allow comparison of how the model fit changes based on the number of nodes. That can be insightful, and may help depict the phenomenon of interest. – Ira S Jan 14 '19 at 17:10
  • +1 Great answer! How do you visualize how the spline turned out? I mean, the graph of the coefficients? – Erosennin Jan 28 '19 at 10:56
  • extracting the model fit and plotting it to the data is informative. Plotting the residuals may also be informative. that is as far as I've gone with it. – Ira S Jan 29 '19 at 17:37
  • @nukimov comment below is IMO correcting incorrect statements made in this answer. – Marta Karas Feb 02 '21 at 14:06
  • This answer is incorrect -- see @numikov 's answer below. `mgcv::s` states that the default value of k depends on the "on the number of variables that the smooth is a function of." `?mgcv::choose.k` provides details on empirically selecting k. – red Aug 02 '21 at 16:57