2

The basis dimension (k) in GAMs allows for great flexibility in curve fitting.

In my application having a large enough k is necessary to ensure GAM fits are monotonic.

k = 20 encompasses a larger function space than does k = 10, for example.

The term "basis dimension" could be potentially confused with the concepts of "basis and dimension" that are central to linear algebra.

I have not been able to find a good "layman's explanation" of k in Simon Wood's excellent text on fitting GAMs in R using the mgcv package.

Can anyone shed some light?

Thanks.

compbiostats
  • 1,017
  • 9
  • 18
  • If you require monotonicity there are bases that will give you that at any $k$. – Glen_b Sep 09 '17 at 03:17
  • @Glen_b My thought is to compare cubic and P-spline GAM bases... I start with k = 20 and use gam.check()... if there are patterns in the data, I will then double k and re-fit. I am also going to use Gaussian processes/Kriging and shape-constrained additive models (SCAMs), the latter of which impose monotonicity and concavity constrains when doing the fitting, – compbiostats Sep 09 '17 at 04:01
  • I'll repeat Glen_b's comment; "In my application having a large enough k is necessary to ensure GAM fits are monotonic." is odd; the dimensionality of the basis has nothing to do with monotonicity. If you use a regular basis there is nothing there to constrain the resultant function to be monotonic, no matter how many basis functions you use. Likewise, you can enforce monotonicity in a basis with suitable constraints at any value of $k$. In this context, this is an un-needed distraction when considering what a basis is and what its dimension is. – Gavin Simpson Sep 11 '17 at 16:24

1 Answers1

4

$k$ is the dimensionality of the spline basis expansion of 1 or possibly more covariates. By default (with the thinplate spline basis and a spline for a single covariate) the basis will contain $k$ = k - 1 basis functions. These functions describe a function space. (This space is a vector space.)

A basis is what results from a basis expansion. In other words, a basis is the set of functions, and the individual functions might be called basis functions.

The dimension of the basis is typically written as $n$, the number of basis functions in the basis, but in mgcv typically this is written as $k$.

A layperson's explanation of $k$ is simply that $k$ is the maximum possible degrees of freedom allowed for a smooth term in the model. Note that $k$ is the maximum degrees of freedom allowed for a single smooth term in the model, but it invariably will not be k. Typically $k$ will be k - 1 due to the identifiability constraint on the smooth term.

The terms you refer to from linear algebra are the same as the terms basis and dimension for splines in GAMs.

A vector in a vector space is a linear combination of the bases of the vector space (or basis):

$$\mathbf{v} = a_1\mathbf{b}_{i_1} + a_2\mathbf{b}_{i_2} + \cdots + a_n\mathbf{b}_{i_n}$$

where the a are scale weights specifying how much of each basis contributes to the vector $\mathbf{v}$. The $\mathbf{b}$ are the vectors that are contained in the vector space. The basis is the collection of vectors $\mathbf{b}$, and the dimension of the basis is $n$, the number of vectors in the basis.

In a spline basis we have the same thing but we might think of it more as a linear combination of basis functions but it is conceptually the same

$$\mathbf{f} = a_1\mathbf{b}_{i_1} + a_2\mathbf{b}_{i_2} + \cdots + a_n\mathbf{b}_{i_n}$$

where now $\mathbf{b_{._n}}$ are the values of the basis function evaluated at a given value of the covariate

$$f(x_i) = a_1b_1(x_i) + a_2b_2(x_i) + \cdots + a_nb_n(x_i)$$

(I may be butchering notation here, but...) hence we are evaluating the function $f$, the spline, at the $i$th value of the covariate $x$. The value taken by the function is a linear combination of the values of the $n$ basis functions, each evaluated at the $i$th value of the covariate. The scalars $a$ are the coefficients estimated when fitting the GAM.

Gavin Simpson
  • 37,567
  • 5
  • 110
  • 153
  • This is a nice explanation! Though simply expressing k as the maximum allowable degrees of freedom for smooth terms should be a sufficient explanation for biologists! Thanks! – compbiostats Sep 11 '17 at 16:24