Lagging over a grouped time series

Question

I have a few tens of thousands of observations that are in a time series but grouped by locations. For example:

location date     observationA observationB
---------------------------------------
 A       1-2010   22           12
 A       2-2010   26           15
 A       3-2010   45           16
 A       4-2010   46           27
 B       1-2010   167          48
 B       2-2010   134          56
 B       3-2010   201          53
 B       4-2010   207          42

I want to see if month x's observationA has any linear relationship with month x+1's observationB.

I did some research and found a zoo function, but it doesn't appear to have a way to limit the lag by group. So if I used zoo and lagged observationB by 1 row, I'd end up with the location A's last observationB as location B's first observationB. I'd rather have the first observationB of any location to be NA or some other obvious value to indicate "don't touch this row".

I guess what I'm getting at is whether there's a built-in way of doing this in R? If not, I imagine I can get this done with a standard loop construct. Or do I even need to manipulate the data?

mpiktas · Accepted Answer · 2015-02-25T14:19:43.633

There are several ways how you can get a lagged variable within a group. First of all you should sort the data, so that in each group the time is sorted accordingly.

First let us create a sample data.frame:

> set.seed(13)
> dt <- data.frame(location = rep(letters[1:2], each = 4), time = rep(1:4, 2), var = rnorm(8))
> dt
  location time        var
1        a    1  0.5543269
2        a    2 -0.2802719
3        a    3  1.7751634
4        a    4  0.1873201
5        b    1  1.1425261
6        b    2  0.4155261
7        b    3  1.2295066
8        b    4  0.2366797

Define our lag function:

 lg <- function(x)c(NA, x[1:(length(x)-1)])

Then the lag of variable within group can be calculated using tapply:

 > unlist(tapply(dt$var, dt$location, lg))
    a1         a2         a3         a4         b1         b2         b3         b4 
    NA  0.5543269 -0.2802719  1.7751634         NA  1.1425261  0.4155261  1.2295066

Using ddply from package plyr:

> ddply(dt, ~location, transform, lvar = lg(var))
  location time        var       lvar
1        a    1 -0.1307015         NA
2        a    2 -0.6365957 -0.1307015
3        a    3 -0.6417577 -0.6365957
4        a    4 -1.5191950 -0.6417577
5        b    1 -1.6281638         NA
6        b    2  0.8748671 -1.6281638
7        b    3 -1.3343222  0.8748671
8        b    4  1.5431753 -1.3343222

Speedier version using data.table from package data.table

 > ddt <- data.table(dt)
 > ddt[,lvar := lg(var), by = c("location")]
     location time        var       lvar
[1,]        a    1 -0.1307015         NA
[2,]        a    2 -0.6365957 -0.1307015
[3,]        a    3 -0.6417577 -0.6365957
[4,]        a    4 -1.5191950 -0.6417577
[5,]        b    1 -1.6281638         NA
[6,]        b    2  0.8748671 -1.6281638
[7,]        b    3 -1.3343222  0.8748671
[8,]        b    4  1.5431753 -1.3343222

Using lag function from package plm

 > pdt <- pdata.frame(dt)
 > lag(pdt$var)
   a-1        a-2        a-3        a-4        b-1        b-2        b-3        b-4 
    NA  0.5543269 -0.2802719  1.7751634         NA  1.1425261  0.4155261  1.2295066

Using lag function from package dplyr

> dt %>% group_by(location) %>% mutate(lvar = lag(var))        
Source: local data frame [8 x 4]
Groups: location        
  location time        var       lvar
1        a    1  0.5543269         NA
2        a    2 -0.2802719  0.5543269
3        a    3  1.7751634 -0.2802719
4        a    4  0.1873201  1.7751634
5        b    1  1.1425261         NA
6        b    2  0.4155261  1.1425261
7        b    3  1.2295066  0.4155261
8        b    4  0.2366797  1.2295066

Last two approaches require conversion from data.frame to another object, although then you do not need to worry about sorting. My personal preference is the last one, which was not available when writing the answer initially.

Update: Changed the data.table code to reflect the developments of the data.table package, pointed out by @Hibernating.

Update 2: Added dplyr example.

Great explanation! Is there a package/function that can deal with irregularly spaced grouped time series (panels) and unbalanced panels? — Helix123, May 09 '16 at 08:10
All of the code examples would work for unbalanced panels. For irregularly spaced time series the concept of lag is a bit complicated, since it may not exist for all groups. — mpiktas, May 09 '16 at 08:45
You can ask about lags for irregular time series in stackoverflow. These types of questions are now off-topic in stats.SE. — mpiktas, May 09 '16 at 08:47

score 2 · Answer 2 · edited Jun 27 '14 at 21:37

2

Rather than going through all the tapply and additional steps, here's a faster way:

dt<-data.frame(location=rep(letters[1:2],each=4),time=rep(1:4,2),var=rnorm(8))
lg<-function(x)c(NA,x[1:(length(x)-1)])
dt$lg <- ave(dt$var, dt$location, FUN=lg)

edited Jun 27 '14 at 21:37

Nick Stauner

11,558
5
47
105

answered Jun 27 '14 at 21:18

Anirban Sengupta

21
1

score 2 · Answer 3 · answered Sep 01 '14 at 15:29

2

With dplyr

dt %>% group_by(location) %>% mutate(lvar=lag(var))

answered Sep 01 '14 at 15:29

Matthew

141
4

score 2 · Answer 4 · answered Sep 01 '20 at 21:39

Just to provide a brief update: The new fastest way to do this in R is with the function flag/L in the collapse package. collapse also supports sequences of lags/leads on vectors, matrices and data frames.

library(collapse)
dt <- data.frame(location = rep(letters[1:2], each = 4), time = rep(1:4, 2), var = rnorm(8))
# Fastest way to append data.frame with lagged variable
settransform(dt, lvar = flag(var, 1, location, time))
dt
location time        var       lvar
1        a    1 -0.5808824         NA
2        a    2 -0.1606213 -0.5808824
3        a    3  0.6499493 -0.1606213
4        a    4 -0.2126608  0.6499493
5        b    1 -0.5082747         NA
6        b    2 -0.7450488 -0.5082747
7        b    3 -1.5895110 -0.7450488
8        b    4  0.2482062 -1.5895110

# Using plm classes - supported by collapse
pdt <- plm::pdata.frame(dt)
flag(pdt$var)
a-1        a-2        a-3        a-4        b-1        b-2        b-3        b-4 
NA -0.5808824 -0.1606213  0.6499493         NA -0.5082747 -0.7450488 -1.5895110 

# Using lag operator directly
L(dt, 1, var ~ location, ~ time)
location time     L1.var
1        a    1         NA
2        a    2 -0.5808824
3        a    3 -0.1606213
4        a    4  0.6499493
5        b    1         NA
6        b    2 -0.5082747
7        b    3 -0.7450488
8        b    4 -1.5895110

# Benchmark
library(data.table); ddt <- data.table(dt)
f2 <- function() ddt[, lvar:=shift(var), by=location]
f3 <- function() settransform(dt, lvar = flag(var, 1, location, time))
microbenchmark::microbenchmark(data.table = f2(), collapse = f3(), times=1000L)

Unit: microseconds
       expr     min      lq     mean  median       uq       max neval cld
 data.table 518.539 568.519 788.7076 638.579 779.5935 23060.711  1000   b
   collapse  44.179  60.913 100.3122  78.094 104.1990  2941.214  1000  a

score 2 · Answer 5 · answered Dec 16 '13 at 10:58

@ mpiktas Just to briefly mention two small oversights in version 3 of your answer. Firstly, the phrase "speedier version" has clearly been left in by error. Secondly, the word ":=" has been missed out in the code. Fixing the latter fixes the former :=)

library(data.table);ddt <- data.table(dt)
f0<-function() plyr::ddply(dt,~location,transform,lvar=lg(var))
f1<-function() ddt[,transform(.SD,lvar=lg(var)),by=c("location")]
f2<-function() ddt[,lvar:=lg(var),by=location]
r0<-f0();r1<-f1();r2<-f2();all.equal(r0,r1,r2,check.attributes = FALSE)
boxplot(microbenchmark::microbenchmark(f0(),f1(),f2(),times=1000L))

enter image description here

Wayne · Answer 6 · 2015-02-25T16:49:10.793

0

You might want to look at the vars package. Sounds like a Vector Autoregression (VAR) is what you may be trying to do.

edited Feb 25 '15 at 16:49

answered Feb 25 '15 at 16:39

Wayne

19,981
4
50
99

score 0 · Answer 7 · answered Apr 26 '16 at 19:58

With DataCombine:

library(DataCombine)
slide(df, Var="observationB", TimeVar="date", GroupVar="location", NewVar="lead.observationB", 
slideBy = 1, keepInvalid = FALSE, reminder = FALSE)

Data needs to be sorted as well. Use slideBy=-1 instead for lags.

Lagging over a grouped time series

7 Answers7

Linked