Is there a maximum number of independent variables for generalized additive model?

Question

I noticed that we sometimes pick the indenpendent variables into the GAM model, but don't consider if we have add "too many" of them. For example, if I have a dataset with 100 observations, what is the appropriate number of covariates to be included in the model? One, five, or twenty? (I suppose twenty is too many, because we only have 100 data points)

Is the number of predictors chosen subjectively?

Gavin Simpson · Accepted Answer · 2019-05-10T03:18:26.657

The number of predictors should reflect your understanding of the system you are modelling and the hypotheses you are testing.

Technically, you can have as many covariates in the model as observations minus 2 (one for the intercept and one to have some hope of estimating everything). In a GAM, with 100 observations and 10 covariates using say the default basis size in mgcv in R for each covariate, you would be fitting a model with 91 coefficients (10 covariates times 9 basis functions as k = 10 but only 9 get used due to identifiability constraints, plus one for the intercept).

Using method = 'REML' you could likes fit this model and the wiggliness penalties on the 9 smooth functions would more than likely be able to fit this model, assuming non of the covariates is concurve with other covariates (nonlinear version of collinearity). If you want to do selection of terms then you'll need to use something like a second penalty on the null space of the basis (via select = TRUE in mgcv).

In practice, however, with so many covariates given the data set size, you are going to run into the issue of having highly uncertain estimated smooth functions.

With a GAM therefore, especially if you are limited by sample size, you need to think very carefully about the expected wiggliness of each of the smooth effects of the covariates you want to include in the model. do you really need 9 degrees of wiggliness freedom to represent all of the smooth effects of the covariates? For many situations in GAMs, especially when we are not modelling temporal or spatial trends (or both), the effects of covariates tend to be relatively smooth. So perhaps you can get away with k = 6 to get 5 basis functions per smooth. That will give you some headroom with which to estimate the model. Then you can use several checks to see if the basis expansion should have used more basis functions for some of the covariates (say via gam.check() in mgcv).

In summary;

you should always think about the terms you are including in the model and justify their inclusion:

You don't want to be seduced by some unexpected significant but spuriously significant effect that arose because of the particular sample of data you collected or the variables you threw into the model.

Likewise, you wouldn't through all variables at your disposal into a model because of collinearity (or concurvity) problems.

So some thought has to precede model fitting.
In a GAM you also have to think about the size of the basis expansion used to represent the effect of each covariate on the response. You want the basis expansions to be rich enough to capture the true effect or a close approximation to it, but you don't want to needlessly waste degrees of freedom on extra basis functions that will just be penalised almost entirely out of the model.

So think about how complex (wiggly) each of the anticipated smooth effects should be and set the basis size accordingly.
Finally, if you really don't know which variables should be in the model, then applying the Double Penalty approach of Marra & Wood (2012) can put a second penalty on the null space of the basis (those basis functions that are perfectly smooth from the viewpoint of the penalty), which can shrink terms out of the model entirely in a similar way to the Lasso penalty.

Jsimp · Answer 2 · 2019-05-09T16:18:34.867

1

there is no such thing as the right number. This is totally dependent on the problem.

If you want to use classical regression methods the only "real" restriction is that you should have less than 100 variables.

I usually use a lasso regression with CV lambda to select the variables for the GANMmodel.

edited May 09 '19 at 16:18

answered May 09 '19 at 07:59

Jsimp

51
5

I don't think the op was talking about GANs (Generalised Adversarial Networks) or was that a typo? (Typically you'll get better performance for a GAM with a penalised spline approach using ridge penalties and a second null space penalty for the shrinkage of terms out of the model). – Gavin Simpson May 09 '19 at 15:45

Is there a maximum number of independent variables for generalized additive model?

2 Answers2