10

My questions deals with GAMs in the mgcv R package. Due to a small sample size I want to determine the prediction error using leave-one-out cross-validation. Is this reasonable? Is there a package or code how I can do this? The errorest() function in the ipred package does not work. A simple test dataset is:

library(mgcv)
set.seed(0)
dat <- gamSim(1,n=400,dist="normal",scale=2)
b<-gam(y~s(x0)+s(x1)+s(x2)+s(x3),data=dat)
summary(b)
pred <- predict(b, type="response")

Thank you very much for your helping hand!

Gavin Simpson
  • 37,567
  • 5
  • 110
  • 153
Peter
  • 223
  • 1
  • 2
  • 4
  • You can have a look at the CVgam function http://www.inside-r.org/packages/cran/gamclass/docs/CVgam I hope this can help – user051514 Nov 06 '14 at 13:10

2 Answers2

4

I really like the package caret for things like this but unfortunately I just read that you can't specify the formula in gam exactly for it.

"When you use train with this model, you cannot (at this time) specify the gam formula. caret has an internal function that figures out a formula based on how many unique levels each predictor has etc. In other words, train currently determines which terms are smoothed and which are plain old linear main effects."

source: https://stackoverflow.com/questions/20044014/error-with-train-from-caret-package-using-method-gam

but if you let train select the smooth terms, in this case it produces your model exactly anyway. The default performance metric in this case is RMSE, but you can change it using the summaryFunction argument of the trainControl function.

I think one of the main drawbacks of LOOCV is that when the dataset is large, it takes forever. Since your dataset is small and it works quite fast, I think it is a sensible option.

Hope this helps.

library(mgcv)
library(caret)

set.seed(0)

dat <- gamSim(1, n = 400, dist = "normal", scale = 2)

b <- train(y ~ x0 + x1 + x2 + x3, 
        data = dat,
        method = "gam",
        trControl = trainControl(method = "LOOCV", number = 1, repeats = 1),
        tuneGrid = data.frame(method = "GCV.Cp", select = FALSE)
)

print(b)
summary(b$finalModel)

output:

> print(b)
Generalized Additive Model using Splines 

400 samples
  9 predictors

No pre-processing
Resampling: 

Summary of sample sizes: 399, 399, 399, 399, 399, 399, ... 

Resampling results

  RMSE      Rsquared 
  2.157964  0.7091647

Tuning parameter 'select' was held constant at a value of FALSE

Tuning parameter 'method' was held constant at a value of GCV.Cp

> summary(b$finalModel)

Family: gaussian 
Link function: identity 

Formula:
.outcome ~ s(x0) + s(x1) + s(x2) + s(x3)

Parametric coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   7.9150     0.1049   75.44   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Approximate significance of smooth terms:
        edf Ref.df       F  p-value    
s(x0) 5.173  6.287   4.564 0.000139 ***
s(x1) 2.357  2.927 103.089  < 2e-16 ***
s(x2) 8.517  8.931  84.308  < 2e-16 ***
s(x3) 1.000  1.000   0.441 0.506929    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

R-sq.(adj) =  0.726   Deviance explained = 73.7%
GCV =  4.611  Scale est. = 4.4029    n = 400
Jeff
  • 857
  • 6
  • 16
1

In the mgcv library pdf it says;

"Given a model structure specified by a gam model formula, gam() attempts to find the appropriate smoothness for each applicable model term using prediction error criteria or likelihood based methods. The prediction error criteria used are Generalized (Approximate) Cross Validation (GCV or GACV) when the scale parameter is unknown or an Un-Biased Risk Estimator (UBRE) when it is known."

"gam in mgcv solves the smoothing parameter estimation problem by using the Generalized Cross Validation (GCV) criterion: nD/(n − DoF )2

or

an Un-Biased Risk Estimator (UBRE )criterion: D/n + 2sDoF/n − s"

Brad
  • 522
  • 4
  • 10