14

Sometimes I need to get only the first row of a data set grouped by an identifier, as when retrieving age and gender when there are multiple observations per individual. What's a fast (or the fastest) way to do this in R? I used aggregate() below and suspect there are better ways. Before posting this question I searched a bit on google, found and tried ddply, and was surprised that it was extremely slow and gave me memory errors on my dataset (400,000 rows x 16 cols, 7,000 unique IDs), whereas the aggregate() version was reasonably fast.

(dx <- data.frame(ID = factor(c(1,1,2,2,3,3)), AGE = c(30,30,40,40,35,35), FEM = factor(c(1,1,0,0,1,1))))
# ID AGE FEM
#  1  30   1
#  1  30   1
#  2  40   0
#  2  40   0
#  3  35   1
#  3  35   1
ag <- data.frame(ID=levels(dx$ID))
ag <- merge(ag, aggregate(AGE ~ ID, data=dx, function(x) x[1]), "ID")
ag <- merge(ag, aggregate(FEM ~ ID, data=dx, function(x) x[1]), "ID")
ag
# ID AGE FEM
#  1  30   1
#  2  40   0
#  3  35   1
#same result:
library(plyr)
ddply(.data = dx, .var = c("ID"), .fun = function(x) x[1,])

UPDATE: See Chase's answer and Matt Parker's comment for what I consider to be the most elegant approach. See @Matthew Dowle's answer for the fastest solution which uses the data.table package.

lockedoff
  • 1,795
  • 2
  • 12
  • 19
  • Thanks for all your answers. The data.table solution of @Steve was the fastest by a factor of ~5 on my data set over the aggregate() solution of @Gavin (which in turn was faster than my aggregate() code), and a factor of ~7.5 over the by() solution of @Matt. I didn't time the reshape idea because I couldn't get it working quickly. I'm guessing the solution that @Chase gave will be the fastest and it was actually what I was looking for, but when I started writing this comment, the code wasn't working (I see it's fixed now!). – lockedoff Mar 04 '11 at 18:17
  • Actually @Chase was faster by a factor of ~9 over data.table, so I changed my accepted answer. Thanks again everyone - learned a bunch of new tools. – lockedoff Mar 04 '11 at 18:29
  • sorry, I fixed my code. The one caveat or trick here is to concatenate a value that isn't one of your IDs in the `diff()` so that you can pick up the first ID in `dx`. – Chase Mar 04 '11 at 18:30

6 Answers6

17

Following up to Steve's reply, there is a much faster way in data.table :

> # Preamble
> dx <- data.frame(
+     ID = sort(sample(1:7000, 400000, TRUE))
+     , AGE = sample(18:65, 400000, TRUE)
+     , FEM = sample(0:1, 400000, TRUE)
+ )
> dxt <- data.table(dx, key='ID')

> # fast self join
> system.time(ans2<-dxt[J(unique(ID)),mult="first"])
 user  system elapsed 
0.048   0.016   0.064

> # slower using .SD
> system.time(ans1<-dxt[, .SD[1], by=ID])
  user  system elapsed 
14.209   0.012  14.281 

> mapply(identical,ans1,ans2)  # ans1 is keyed but ans2 isn't, otherwise identical
  ID  AGE  FEM 
TRUE TRUE TRUE 

If you merely need the first row of each group, it's much faster to join to that row directly. Why create the .SD object each time, only to use the first row of it?

Compare the 0.064 of data.table to "Matt Parker's alternative to Chase's solution" (which seemed to be the fastest so far) :

> system.time(ans3<-dxt[c(TRUE, dxt$ID[-1] != dxt$ID[-length(dxt$ID)]), ])
 user  system elapsed 
0.284   0.028   0.310 
> identical(ans1,ans3)
[1] TRUE 

So ~5 times faster, but it's a tiny table at under 1 million rows. As size increases, so does the difference.

Matt Dowle
  • 271
  • 1
  • 5
  • Wow, I never really appreciated how "smart" the `[.data.table` function can get ... I guess I didn't realize you didn't create an `.SD` object if you didn't really need it. Nice one! – Steve Lianoglou Mar 08 '11 at 04:18
  • Yes, that is indeed fast! Even if you include `dxt – lockedoff Mar 08 '11 at 16:32
  • I guess this outdated now as with the newer data.table versions `SD[1L]` was fully optimized and actually @SteveLianoglou answer would be twice as fast for 5e7 rows. – David Arenburg May 04 '18 at 07:11
  • @DavidArenburg As from v1.9.8 Nov 2016, yes. Feel free to edit this answer directly, or maybe this Q needs to be community wiki or something. – Matt Dowle May 04 '18 at 18:15
11

Is your ID column really a factor? If it is in fact numeric, I think you can use the diff function to your advantage. You could also coerce it to numeric with as.numeric().

dx <- data.frame(
    ID = sort(sample(1:7000, 400000, TRUE))
    , AGE = sample(18:65, 400000, TRUE)
    , FEM = sample(0:1, 400000, TRUE)
)

dx[ diff(c(0,dx$ID)) != 0, ]
Chase
  • 3,055
  • 2
  • 19
  • 28
  • 1
    Clever! You could also do `dx[c(TRUE, dx$ID[-1] != dx$ID[-length(dx$ID)], ]` for non-numeric data - I get 0.03 for character, 0.05 for factors. PS: there's an extra `)` in your first `system.time()` function, after the second zero. – Matt Parker Mar 04 '11 at 18:32
  • @Matt - good call and nice catch. I don't appear to be able to copy/paste code worth a flip today. – Chase Mar 04 '11 at 18:33
  • I am working on the London Cycle Hire scheme, and needed to find a way to find the first and last instances of users bike hires. With 1 million users, 10 million trips per year and several years' data, my "for" loop was doing 1 user per second. I tried the "by" solution, and it failed to complete after an hour. At first I couldn't fathom what "Matt Parker's alternative to Chase's solution" was doing, but finally the penny dropped, and it executes in seconds. So the point about the improvement becoming greater with larger datasets is proven by my experience. – George Simpson Apr 28 '15 at 08:01
  • @GeorgeSimpson - glad to see this is still being referenced! The `data.table` solution down below should prove to be the fastest, so I'd check that out if I were you (it should probably be the accepted answer here). – Chase Apr 28 '15 at 12:17
10

You can try to use the data.table package.

For your particular case, the upside is that it's (insanely) fast. The first time I was introduced to it, I was working on data.frame objects with hundreds of thousands of rows. "Normal" aggregate or ddply methods were taken ~ 1-2 mins to complete (this was before Hadley introduced the idata.frame mojo into ddply). Using data.table, the operation was literally done in a matter of seconds.

The downside is that its so fast because it will resort your data.table (it's just like a data.frame) by "key columns" and use a smart searching strategy to find subsets of your data. This will result in a reordering of your data before you collect stats over it.

Given that you will just want the first row of each group -- maybe the reordering will mess up which row is first, which is why it might not be appropriate in your situation.

Anyway, you'll have to judge whether or not data.table is appropriate here, but this is how you would use it with the data you've presented:

install.packages('data.table') ## if yo udon't have it already
library(data.table)
dxt <- data.table(dx, key='ID')
dxt[, .SD[1,], by=ID]
     ID AGE FEM
[1,]  1  30   1
[2,]  2  40   0
[3,]  3  35   1

Update: Matthew Dowle (the main developer of the data.table package) has provided a better/smarter/(extremely) more efficient way to use data.table to solve this problem as one of the answers here ... definitely check that out.

Steve Lianoglou
  • 236
  • 1
  • 5
10

You don't need multiple merge() steps, just aggregate() both variables of interest:

> aggregate(dx[, -1], by = list(ID = dx$ID), head, 1)
  ID AGE FEM
1  1  30   1
2  2  40   0
3  3  35   1

> system.time(replicate(1000, aggregate(dx[, -1], by = list(ID = dx$ID), 
+                                       head, 1)))
   user  system elapsed 
  2.531   0.007   2.547 
> system.time(replicate(1000, {ag <- data.frame(ID=levels(dx$ID))
+ ag <- merge(ag, aggregate(AGE ~ ID, data=dx, function(x) x[1]), "ID")
+ ag <- merge(ag, aggregate(FEM ~ ID, data=dx, function(x) x[1]), "ID")
+ }))
   user  system elapsed 
  9.264   0.009   9.301

Comparison timings:

1) Matt's solution:

> system.time(replicate(1000, {
+ agg <- by(dx, dx$ID, FUN = function(x) x[1, ])
+ # Which returns a list that you can then convert into a data.frame thusly:
+ do.call(rbind, agg)
+ }))
   user  system elapsed 
  3.759   0.007   3.785

2) Zach's reshape2 solution:

> system.time(replicate(1000, {
+ dx <- melt(dx,id=c('ID','FEM'))
+ dcast(dx,ID+FEM~variable,fun.aggregate=mean)
+ }))
   user  system elapsed 
 12.804   0.032  13.019

3) Steve's data.table solution:

> system.time(replicate(1000, {
+ dxt <- data.table(dx, key='ID')
+ dxt[, .SD[1,], by=ID]
+ }))
   user  system elapsed 
  5.484   0.020   5.608 
> dxt <- data.table(dx, key='ID') ## one time step
> system.time(replicate(1000, {
+ dxt[, .SD[1,], by=ID] ## try this one line on own
+ }))
   user  system elapsed 
  3.743   0.006   3.784

4) Chase's fast solution using numeric, not factor, ID:

> dx2 <- within(dx, ID <- as.numeric(ID))
> system.time(replicate(1000, {
+ dy <- dx[order(dx$ID),]
+ dy[ diff(c(0,dy$ID)) != 0, ]
+ }))
   user  system elapsed 
  0.663   0.000   0.663

and 5) Matt Parker's alternative to Chase's solution, for character or factor ID, which is slightly faster than Chase's numeric ID one:

> system.time(replicate(1000, {
+ dx[c(TRUE, dx$ID[-1] != dx$ID[-length(dx$ID)]), ]
+ }))
   user  system elapsed 
  0.513   0.000   0.516
Gavin Simpson
  • 37,567
  • 5
  • 110
  • 153
  • Oh, right, thanks! Forgot about that syntax for aggregate. – lockedoff Mar 04 '11 at 17:53
  • If you'd like to add Chase's solution, here's what I got: `dx$ID – lockedoff Mar 04 '11 at 18:34
  • @lockedoff - done, thanks, but I didn't randomly sample the `ID`s so the result were comparable to other solutions. – Gavin Simpson Mar 04 '11 at 20:17
  • And time @Matt Parker's version in the comments to @Chase's answer – Gavin Simpson Mar 04 '11 at 20:21
  • 2
    Thanks for doing the timings, Gavin - that's really helpful for questions like these. – Matt Parker Mar 04 '11 at 20:53
  • Of course, it just now occurs to me that Chase's solution and my variant both rely on pre-sorted data. Gavin, would you mind testing to see how sorting the data affects the timings? It took almost three times as long on my (considerably slower) computer when I used Chase's version of `dx`. – Matt Parker Mar 04 '11 at 21:08
  • @Matt can I just check what you mean? You version works fine with a random ordering of rows in the original `dx`. The only difference is the result is not in the order of the `ID`s. – Gavin Simpson Mar 05 '11 at 09:51
  • Hmm, it shouldn't as it's dependent on change in ID from one record to the next to identify first records. Try: `dx.mix – Matt Parker Mar 06 '11 at 18:19
4

You could try

agg <- by(dx, dx$ID, FUN = function(x) x[1, ])
# Which returns a list that you can then convert into a data.frame thusly:
do.call(rbind, agg)

I have no idea if this will be any faster than plyr, though.

Matt Parker
  • 5,597
  • 5
  • 26
  • 37
4

Try reshape2

library(reshape2)
dx <- melt(dx,id=c('ID','FEM'))
dcast(dx,ID+FEM~variable,fun.aggregate=mean)
Zach
  • 22,308
  • 18
  • 114
  • 158