30

The following code evaluates the similarity between two time series:

set.seed(10)
RandData <- rnorm(8760*2)
America <- rep(c('NewYork','Miami'),each=8760)

Date = seq(from=as.POSIXct("1991-01-01 00:00"), 
           to=as.POSIXct("1991-12-31 23:00"), length=8760)

DatNew <- data.frame(Loc = America,
                     Doy = as.numeric(format(Date,format = "%j")),
                     Tod = as.numeric(format(Date,format = "%H")),
                     Temp = RandData,
                     DecTime = rep(seq(1, length(RandData)/2) / (length(RandData)/2),
                                   2))
require(mgcv)
mod1 <- gam(Temp ~ Loc + s(Doy) + s(Doy,by = Loc) +
  s(Tod) + s(Tod,by = Loc),data = DatNew, method = "ML")

Here, gam is used to evaluate how the temperature at New York and Miami vary from the mean temperature (of both locations) at different times of the day. The problem that I now have is that I need to include an interaction term which shows how the temperature of each location varies throughout at the day for different days of the year. I eventually hope to display all of this information on one graph (for each location). So, for Miami I hope to have one graph that shows how the temperature varies from the mean during different times of the day and different times of the year (3d plot?)

chl
  • 50,972
  • 18
  • 205
  • 364
KatyB
  • 909
  • 2
  • 12
  • 17
  • 3
    You might find the answer to this question http://stats.stackexchange.com/questions/18937/adjusting-for-tilt-of-the-earth/19078#19078 relevant. – jbowman Jul 21 '12 at 14:42

2 Answers2

34

For two continuous variables then you can do what you want (whether this is an interaction or not I'll leave others to discuss as per comments to @Greg's Answer) using:

mod1 <- gam(Temp ~ Loc + s(Doy, bs = "cc", k = 5) + 
                         s(Doy, bs = "cc", by = Loc, k = 5, m = 1) + 
                         s(Tod, bs = "cc", k = 5) + 
                         s(Tod, bs = "cc", by = Loc, k = 5, m = 1) +
                         te(Tod, Doy, by = Loc, bs = rep("cc",2)),
            data = DatNew, method = "ML")

The simpler model should then be nested within the more complex model above. That simpler model is:

mod0 <- gam(Temp ~ Loc + s(Doy, bs = "cc", k = 5) + 
                         s(Doy, bs = "cc", by = Loc, k = 5, m = 1) + 
                         s(Tod, bs = "cc", k = 5) + 
                         s(Tod, bs = "cc", by = Loc, k = 5, m = 1),
            data = DatNew, method = "ML")

Note two things here:

  1. The basis-type for each smoother is stated. In this case we would expect that there are no discontinuities in Temp between 23:59 and 00:00 for Tod nor between Doy == 1 and Doy == 365.25. Hence cyclic cubic splines are appropriate, indicated here via bs = "cc".
  2. The basis dimension is stated explicitly (k = 5). This matches the default basis dimension for each smooth in a te() term.

Together these features ensure that the simpler model really is nested within the more complex model.

For more see ?gam.models in mgcv.

Gavin Simpson
  • 37,567
  • 5
  • 110
  • 153
  • Related to your 2nd point - in addition to the specification of `k`, should one also fix the number of knots (e.g. `fx=TRUE`). If not, the resulting model shows varying `edf` for each term. – Marc in the box Aug 03 '15 at 14:55
  • I should update this answer a little given some new functionality in the **mgcv** package in terms of splines for marginal bases. That said, I don't agree that you need to fix the degrees of freedom for the spline. The key is to ensure that the bases for the models are appropriately nested. Then differences between models are possible by setting some of the coefficients for the basis functions to zero, just as would happen in a linear model with non-spline terms. – Gavin Simpson Aug 03 '15 at 15:02
  • Great - thanks for this clarification and the nice answer. – Marc in the box Aug 03 '15 at 18:00
  • 3
    Hoping someone are still watching this thread and may answer. In these models, how come you must specify both `s(Doy...)` and `s(Doy, by =Loc...)`? I thought the first one would be nested into the latter and thus be unneccessary to specify? – ego_ Jan 25 '16 at 13:48
  • 4
    No, the first smooth is the global function and the by smooth represent site-specific differences between it and the global smooth. The by smooths really need `m = 1` added to them to put the penalty on the first derivative for the difference smooths though. – Gavin Simpson Jan 25 '16 at 14:05
  • Thanks, and tanks for also elaborating on the `m` argument. – ego_ Jan 26 '16 at 12:44
  • @GavinSimpson is the same point about needing to include both `s(Doy...)` and `s(DOY, by = Loc...)` the case if `ts()` is used instead of `s()`? – Joshua Rosenberg Jan 10 '17 at 21:19
  • 2
    @JoshuaRosenberg If you mean `te()`, it depends what you are including in the tensor product? The interactions described here are factor-smooth interactions, but `te()` would imply two or more continuous variables. If you want global terms and subject-specific deviations, then yes, `te(DoY, Year, by = Loc, m = 1)` could be use alongside `te(DoY, Year)`, although there are other ways to achieve similar things using random effect-like factor-smooth interaction and `te()` terms containing a random effect spline. – Gavin Simpson Jan 11 '17 at 02:05
  • @GavinSimpson I have just asked a question about the different ways of specifying interactions between continuous variables and factors in GAMs, perhaps you can help shed some light on it. If so, thanks in advance. https://stats.stackexchange.com/questions/403772/different-ways-of-modelling-interactions-between-continuous-and-categorical-pred – Marco Plebani Apr 18 '19 at 12:48
23

The "a" in "gam" stands for "additive" which means no interactions, so if you fit interactions you are really not fitting a gam model any more.

That said, there are ways to get some interaction like terms within the additive terms in a gam, you are already using one of those by using the by argument to s. You could try extending this to having the argument by be a matrix with a function (sin, cos) of doy or tod. You could also just fit smoothing splines in a regular linear model that allows interactions (this does not give the backfitting that gam does, but could still be useful).

You might also look at projection pursuit regression as another fitting tool. Loess or more parametric models (with sin and/or cos) might also be useful.

Part of decision on what tool(s) to use is what question you are trying to answer. Are you just trying to find a model to predict future dates and times? are you trying to test to see if particular predictors are significant in the model? are you trying to understand the shape of the relationship between a predictor and the outcome? Something else?

Greg Snow
  • 46,563
  • 2
  • 90
  • 159
  • 3
    Suppose you have two predictors $x_1,x_2$ - isn't $$y = f_1(x_1) + f_2(x_2) + f_3(x_1x_2) + \varepsilon$$ still an gam? This could be thought of as an 'interaction' in some sense. Also, I think the `gam` package allows you to fit models like $$y = f_1(x_1) + f_2(x_2) + f_3(x_1)x_2 + f_4(x_2)x_1 + \varepsilon$$ (which I think is what the `by` argument is doing). Is that still a gam? – Macro Jul 21 '12 at 17:43
  • 1
    @Macro, It depends on if you want to split hairs or not (well technically what you wrote would maybe be am's since you are not really using the g part). – Greg Snow Jul 21 '12 at 18:02
  • 2
    @Macro, I'm sure more has been said on the subject, but see page 4 on GAMs of this Venable's article, [Exegeses on Linear Models](http://www.stats.ox.ac.uk/pub/MASS3/Exegeses.pdf). It isn't clear exactly how non-additive main effects and interaction effects are simultaneously identified. – Andy W Jul 22 '12 at 02:20
  • many thanks for your comments. My main aim here is to not predict future values but to simply see how each time series alters from the mean of both series. For example, the outcome of mod1 shows how the time series at each location varies from the mean firstly at the time scale of day of year (Doy) and then time of day (Tod). From this I would like to see how each series varies at a function of Doy and Tod which I expect the time series to differ greatly during the summer period. – KatyB Jul 24 '12 at 09:40
  • 5
    @AndyW It is important to note that GAM's have come a long way since Venable's was commenting on them - developments in the last decade or so with the penalised splines sensu Simon Wood (as implemented in **mgcv**) and the fully bayesian treatments address issues such as smoothness selection, interactions & how to fit them (tensor products of marginal bases being one approach) in the additive model framework. I'm reasonably confident Venable's & Cox's objections to GAMs as outlined in the former's *Exegeses* have to large degree been addressed by these recent developments in GAM theory. – Gavin Simpson Apr 11 '17 at 23:24