The following question builds on the discussion found on this page.
Given a response variable y
, a continuous explanatory variable x
and a factor fac
, it is possible to define a General Additive Model (GAM) with an interaction between x
and fac
using the argument by=
. According to the help file ?gam.models
in the R package mgcv
, this can be accomplished as follows:
gam1 <- gam(y ~ fac +s(x, by = fac), ...)
@GavinSimpson here suggests a different approach:
gam2 <- gam(y ~ fac +s(x) +s(x, by = fac, m=1), ...)
I have been playing around with a third model:
gam3 <- gam(y ~ s(x, by = fac), ...)
My main questions are: are some of these models just wrong, or are they simply different? In the latter case, what are their differences? Based on the example that I am going to discuss below I think I could understand some of their differences, but I am still missing something.
As an example I am going to use a dataset with color spectra for flowers of two different plant species measured at different locations.
rm(list=ls())
# install.packages("RCurl")
library(RCurl) # allows accessing data from URL
df <- read.delim(text=getURL("https://raw.githubusercontent.com/marcoplebani85/datasets/master/flower_color_spectra.txt"))
library(mgcv)
These are the mean color spectra at the locality level for the two species (rolling means were used):
Each color refers to a different species. Each line refers to a different locality.
My final goal is to model the (potentially interactive) effect of Taxon
and wavelength wl
on % reflectance (referred to as density
in the code and dataset) while accounting for Locality
as a random effect in a mixed-effect GAM. For the moment I won't add the mixed effect part to my plate, which is already full enough with trying to understand how to model interactions.
I'll start with the simplest of the three interactive GAMs:
gam.interaction0 <- gam(density ~ s(wl, by = Taxon), data = df)
# common intercept, different slopes
plot(gam.interaction0, pages=1)
summary(gam.interaction0)
Produces:
Family: gaussian
Link function: identity
Formula:
density ~ s(wl, by = Taxon)
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 28.3490 0.1693 167.4 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Approximate significance of smooth terms:
edf Ref.df F p-value
s(wl):TaxonSpeciesA 8.938 8.999 884.3 <2e-16 ***
s(wl):TaxonSpeciesB 8.838 8.992 325.5 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-sq.(adj) = 0.523 Deviance explained = 52.4%
GCV = 284.96 Scale est. = 284.42 n = 9918
The parametric part is the same for both species, but different splines are fitted for each species. It is a bit confusing to have a parametric part in the summary of GAMs, which are non-parametric. @IsabellaGhement explains:
If you look at the plots of the estimated smooth effects (smooths) corresponding to your first model, you will notice that they are centered about zero. So you need to 'shift' those smooths up (if the estimated intercept is positive) or down (if the estimated intercept is negative) to obtain the smooth functions you thought you were estimating. In other words, you need to add the estimated intercept to the smooths to get at what you really want. For your first model, the 'shift' is assumed to be the same for both smooths.
Moving on:
gam.interaction1 <- gam(density ~ Taxon +s(wl, by = Taxon, m=1), data = df)
plot(gam.interaction1,pages=1)
summary(gam.interaction1)
Gives:
Family: gaussian
Link function: identity
Formula:
density ~ Taxon + s(wl, by = Taxon, m = 1)
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 40.3132 0.1482 272.0 <2e-16 ***
TaxonSpeciesB -26.0221 0.2186 -119.1 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Approximate significance of smooth terms:
edf Ref.df F p-value
s(wl):TaxonSpeciesA 7.978 8 2390 <2e-16 ***
s(wl):TaxonSpeciesB 7.965 8 879 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-sq.(adj) = 0.803 Deviance explained = 80.3%
GCV = 117.89 Scale est. = 117.68 n = 9918
Now, each species also have its own parametric estimate.
The next model is the one that I have trouble understanding:
gam.interaction2 <- gam(density ~ Taxon + s(wl) + s(wl, by = Taxon, m=1), data = df)
plot(gam.interaction2, pages=1)
I have no clear idea of what these graphs represent.
summary(gam.interaction2)
Gives:
Family: gaussian
Link function: identity
Formula:
density ~ Taxon + s(wl) + s(wl, by = Taxon, m = 1)
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 40.3132 0.1463 275.6 <2e-16 ***
TaxonSpeciesB -26.0221 0.2157 -120.6 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Approximate significance of smooth terms:
edf Ref.df F p-value
s(wl) 8.940 8.994 30.06 <2e-16 ***
s(wl):TaxonSpeciesA 8.001 8.000 11.61 <2e-16 ***
s(wl):TaxonSpeciesB 8.001 8.000 19.59 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-sq.(adj) = 0.808 Deviance explained = 80.8%
GCV = 114.96 Scale est. = 114.65 n = 9918
The parametric part of gam.interaction2
is about the same as for gam.interaction1
, but now there are three estimates for smooth terms, which I cannot interpret.
Thanks in advance to anyone who will take the time to help me understand the differences in the three models.