How to summarize data by group in R?

Question

I have R data frame like this:

        age group
1   23.0883     1
2   25.8344     1
3   29.4648     1
4   32.7858     2
5   33.6372     1
6   34.9350     1
7   35.2115     2
8   35.2115     2
9   35.2115     2
10  36.7803     1
...

I need to get data frame in the following form:

group mean     sd
1     34.5     5.6
2     32.3     4.2
...

Group number may vary, but their names and quantity could be obtained by calling levels(factor(data$group))

What manipulations should be done with the data to get the result?

the commas in the result data frame mean something special, or is it just the decimal point? — mpiktas, Mar 13 '11 at 12:46
@mpiktas Thank you for noting. Corrected. These were locale issues (I am russian) - we we use comma for decimal separation. — Yuriy Petrovskiy, Mar 13 '11 at 12:59
I suspected that. All of the [Europe](http://en.wikipedia.org/wiki/File:DecimalSeparator.svg) uses comma except the British. — mpiktas, Mar 13 '11 at 13:04
Despite not being British, I prefer dot for decimal separator. — Roman Luštrik, Mar 14 '11 at 12:17
See `aggregate`, `tapply`, and then http://stackoverflow.com for any subsequent coding questions of this type. — conjugateprior, Sep 11 '15 at 23:26

mpiktas · Accepted Answer · 2014-10-25T10:16:17.373

144

Here is the plyr one line variant using ddply:

dt <- data.frame(age=rchisq(20,10),group=sample(1:2,20,rep=T))
ddply(dt,~group,summarise,mean=mean(age),sd=sd(age))

Here is another one line variant using new package data.table.

dtf <- data.frame(age=rchisq(100000,10),group=factor(sample(1:10,100000,rep=T)))
dt <- data.table(dtf)
dt[,list(mean=mean(age),sd=sd(age)),by=group]

This one is faster, though this is noticeable only on table with 100k rows. Timings on my Macbook Pro with 2.53 Ghz Core 2 Duo processor and R 2.11.1:

> system.time(aa <- ddply(dtf,~group,summarise,mean=mean(age),sd=sd(age)))
utilisateur     système      écoulé 
      0.513       0.180       0.692 
> system.time(aa <- dt[,list(mean=mean(age),sd=sd(age)),by=group])
utilisateur     système      écoulé 
      0.087       0.018       0.103

Further savings are possible if we use setkey:

> setkey(dt,group)
> system.time(dt[,list(mean=mean(age),sd=sd(age)),by=group])
utilisateur     système      écoulé 
      0.040       0.007       0.048

edited Oct 25 '14 at 10:16

answered Mar 13 '11 at 12:44

mpiktas

33,140
5
82
138

2

@chl, it gave me a chance to try out this new **data.table** package. It looks really promising. – mpiktas Mar 15 '11 at 12:54
7

+6000 for data.table. It really is so much faster than ddply, even for me on datasets smaller than 100k (I have one with just 20k rows). Must be something to do with the functions I am applying, but ddply will take minutes and data.table a few seconds. – atomicules Sep 22 '11 at 15:22
Simple typo: I think you meant `dt – Christopher Bottoms Oct 24 '14 at 18:50
1

In my (not humble in this case) opinion `data.table` is the best way to aggregate data and this answer is great, but still only scratches the surface. Aside from being syntactically superior, it's also extremely flexible and has many advanced features that involve joins and internal mechanics. Check out the FAQ, github page, or course for more info. – geneorama Oct 29 '14 at 03:47

score 103 · Answer 2 · edited Oct 23 '13 at 16:11

103

One possibility is to use the aggregate function. For instance,

aggregate(data$age, by=list(data$group), FUN=mean)[2]

gives you the second column of the desired result.

edited Oct 23 '13 at 16:11

gung - Reinstate Monica

132,789
81
357
650

answered Mar 13 '11 at 12:12

ocram

19,898
5
76
77

1

Don't link to your local help server :-) +1 but see my comments to @steffen's response. – chl Mar 13 '11 at 12:26
Done the thing by calling `data.frame(group=levels(factor(data$group)),mean=(aggregate(data$age, by=list(data$group), FUN=mean)$x),sd=(aggregate(data$age, by=list(data$group), FUN=sd)$x))` but I am not shure it is correct way. I am not sure what will happen then the results of binded columns will be in different order (I think it is possible). What is your oppinion? – Yuriy Petrovskiy Mar 13 '11 at 12:46
9

@Yuriy The rows should not be out of order, but here is a way to do it one call to `aggregate()`: `aggregate(age ~ group, data=dat, FUN = function(x) c(M=mean(x), SD=sd(x)))` – lockedoff Mar 14 '11 at 16:51
@lockedoff: Thank you for having completed my answer! – ocram Mar 15 '11 at 10:22

Bastiaan Quast · Answer 3 · 2015-11-11T08:03:29.147

26

Since you are manipulating a data frame, the dplyr package is probably the faster way to do it.

library(dplyr)
dt <- data.frame(age=rchisq(20,10), group=sample(1:2,20, rep=T))
grp <- group_by(dt, group)
summarise(grp, mean=mean(age), sd=sd(age))

or equivalently, using the dplyr/magrittr pipe operator:

library(dplyr)
dt <- data.frame(age=rchisq(20,10), group=sample(1:2,20, rep=T))
group_by(dt, group) %>%
 summarise(mean=mean(age), sd=sd(age))

EDIT full use of pipe operator:

library(dplyr)
data.frame(age=rchisq(20,10), group=sample(1:2,20, rep=T)) %>%
  group_by(group) %>%
  summarise(mean=mean(age), sd=sd(age))

edited Nov 11 '15 at 08:03

answered Jul 15 '14 at 12:26

Bastiaan Quast

450
5
12

4

+1 for `dplyr`. It has made so many R tasks simple and many of these methods obsolete. – gregmacfarlane Jul 15 '14 at 13:02
The full use of pipe operator version does not work for me unfortunately – dagcilibili Sep 10 '17 at 05:41
did you load dplyr or magrittr ? – Bastiaan Quast Sep 10 '17 at 09:42
thank you very much @bquast for pointing out towards the solution, summarise function was called from `plyr` instead of `dplyr` which was causing the problem. – dagcilibili Sep 10 '17 at 21:31

score 11 · Answer 4 · answered Jul 15 '14 at 15:01

Great, thanks bquast for adding the dplyr solution!

Turns out that then, dplyr and data.table are very close:

library(plyr)
library(dplyr)
library(data.table)
library(rbenchmark)

dtf <- data.frame(age=rchisq(100000,10),group=factor(sample(1:10,100000,rep=T)))
dt <- data.table(dtf)

setkey(dt,group)

a<-benchmark(ddply(dtf,~group,plyr:::summarise,mean=mean(age),sd=sd(age)),
         dt[,list(mean=mean(age),sd=sd(age)),by=group],
         group_by(dt, group) %>% summarise(mean=mean(age),sd=sd(age) ),
         group_by(dtf, group) %>% summarise(mean=mean(age),sd=sd(age) )
)

a[, c(1,3,4)]

data.table is still the fastest, by followed very closely by dplyr(), which interestingly seems faster on the data.frame than the data.table:

                                                              test elapsed relative
1 ddply(dtf, ~group, plyr:::summarise, mean = mean(age), sd = sd(age))   1.689    4.867
2               dt[, list(mean = mean(age), sd = sd(age)), by = group]   0.347    1.000
4   group_by(dtf, group) %>% summarise(mean = mean(age), sd = sd(age))   0.369    1.063
3    group_by(dt, group) %>% summarise(mean = mean(age), sd = sd(age))   0.580    1.671

At first I thought you needed to move setkey into the benchmark, but turns out that takes almost no time at all. — kasterma, Oct 16 '14 at 14:27

score 11 · Answer 5 · answered Mar 13 '11 at 12:38

11

In addition to existing suggestions, you might want to check out the describe.by function in the psych package.

It provides a number of descriptive statistics including the mean and standard deviation based on a grouping variable.

answered Mar 13 '11 at 12:38

Jeromy Anglim

42,044
23
146
250

its nice, but somewhat tricky to export to LaTeX IME. – richiemorrisroe Mar 14 '11 at 10:08

gung - Reinstate Monica · Answer 6 · 2014-07-09T15:06:36.080

I have found the function summaryBy in the doBy package to be the most convenient for this:

library(doBy)

age    = c(23.0883, 25.8344, 29.4648, 32.7858, 33.6372,
           34.935,  35.2115, 35.2115,  5.2115, 36.7803)
group  = c(1, 1, 1, 2, 1, 1, 2, 2, 2, 1)
dframe = data.frame(age=age, group=group)

summaryBy(age~group, data=dframe, FUN=c(mean, sd))
# 
#   group age.mean    age.sd
# 1     1 30.62333  5.415439
# 2     2 27.10507 14.640441

score 8 · Answer 7 · answered May 23 '13 at 22:25

8

Use the sqldf package. This allows you now to use SQL to summarize the data. Once you load it you can write something like -

sqldf('  select group,avg(age) from data group by group  ')

answered May 23 '13 at 22:25

KalEl

559
3
11

score 8 · Answer 8 · edited Apr 13 '17 at 12:44

8

Edited: According to chl's suggestions

The function you are looking for is called "tapply" which applies a function per group specified by a factor.

# create some artificial data
set.seed(42)
groups <- 5

agedat <- c()
groupdat <- c()

for(group in 1:groups){
    agedat <- c(agedat,rnorm(100,mean=0 + group,1/group))
    groupdat <- c(groupdat,rep(group,100))
}
dat <- data.frame("age"=agedat,"group"=factor(groupdat))

# calculate mean and stdev age per group
res <- rbind.data.frame(group=1:5, with(dat, tapply(age, group, function(x) c(mean(x), sd(x)))))
names(res) <- paste("group",1:5)
row.names(res)[2:3] <- c("mean","sd")

I really suggest to work through a basic R tutorial explaining all commonly used datastructures and methods. Otherwise you will get stuck every inch during programming. See this question for a collection of free available resources.

edited Apr 13 '17 at 12:44

Community

1

answered Mar 13 '11 at 12:15

mlwida

9,922
2
45
74

2

@steffen +1 but there's no need for a `for` loop here, you can contruct your dataframe inline, IMO. For the `tapply` call, use `function(x) c(mean(x),sd(x)))` and `cbind` the result as the OP asked for both statistics. Also, `ddply` from the [plyr](http://had.co.nz/plyr/) package could do this smoothly. – chl Mar 13 '11 at 12:24
@steffen The problem is I need the exactly the table structure I described. There is no problem with getting means and sd. The problem is with stucture. – Yuriy Petrovskiy Mar 13 '11 at 12:35
@chl: Thank you for your comment, did not know about plyr :). I added cbind, but left the rest untouched. May another one take the credit, this answer shall remain as a less optimal example. – mlwida Mar 13 '11 at 12:35
@Yuriy: Added cbind. If you already knew how to apply functions per group, you may reformulate your question (just for clarity ;)). – mlwida Mar 13 '11 at 12:37
@steffen `cbind("mean"=mperage,"stdev"=stperage) gives no 'group' column. Will be joining by `cbind(group=levels(factor(data$group)),"mean"=mperage,"stdev"=stperage)` correct? – Yuriy Petrovskiy Mar 13 '11 at 12:51
@Yuriy: The row names are the group-names .... Nevertheless I added a separate column (now using data.frame) to show this. – mlwida Mar 13 '11 at 12:51
@steffen Are you sure there will be no ordering issues when joining columns? – Yuriy Petrovskiy Mar 13 '11 at 12:53
@Yuriy, tapply orders the results by the order of the levels, so there ordering issues will not arise. @steffen, I upvoted the answer for the effort, before I knew plyr, I used to solve such problems in similar way. – mpiktas Mar 13 '11 at 13:24
@steffen Try to replace your last three lines of code with: `t(rbind.data.frame(group=1:5, with(dat, tapply(age, group, function(x) c(mean(x), sd(x))))))` (+ rename the last two columns as `mean` and `sd`). – chl Mar 13 '11 at 15:36

score 7 · Answer 9 · edited Oct 23 '13 at 16:09

Here is an example with the function aggregates() I did myself some time ago:

# simulates data
set.seed(666)
( dat <- data.frame(group=gl(3,6), level=factor(rep(c("A","B","C"), 6)), 
                    y=round(rnorm(18,10),1)) )

> dat
   group level    y
1      1     A 10.8
2      1     B 12.0
3      1     C  9.6
4      1     A 12.0
5      1     B  7.8
6      1     C 10.8
7      2     A  8.7
8      2     B  9.2
9      2     C  8.2
10     2     A 10.0
11     2     B 12.2
12     2     C  8.2
13     3     A 10.9
14     3     B  8.3
15     3     C 10.1
16     3     A  9.9
17     3     B 10.9
18     3     C 10.3

# aggregates() function
aggregates <- function(formula, data=NULL, FUNS){ 
    if(class(FUNS)=="list"){ 
        f <- function(x) sapply(FUNS, function(fun) fun(x)) 
    }else{f <- FUNS} 
    temp <- aggregate(formula, data, f) 
    out <- data.frame(temp[,-ncol(temp)], temp[,ncol(temp)]) 
    colnames(out)[1] <- colnames(temp)[1] 
return(out) 
} 

# example 
FUNS <- function(x) c(mean=round(mean(x),0), sd=round(sd(x), 0)) 
( ag <- aggregates(y~group:level, data=dat, FUNS=FUNS) )

It gives the following result:

> ag
  group level mean sd
1     1     A   11  1
2     2     A    9  1
3     3     A   10  1
4     1     B   10  3
5     2     B   11  2
6     3     B   10  2
7     1     C   10  1
8     2     C    8  0
9     3     C   10  0

Maybe you can get the same result starting from the R function split():

> with(dat, sapply( split(y, group:level), FUNS ) )
     1:A 1:B 1:C 2:A 2:B 2:C 3:A 3:B 3:C
mean  11  10  10   9  11   8  10  10  10
sd     1   3   1   1   2   0   1   2   0

Let me come back to the output of the aggregates function. You can transform it in a beautiful table using reshape(), xtabs() and ftable():

rag <- reshape(ag, varying=list(3:4), direction="long", v.names="y") 
rag$time <- factor(rag$time) 
ft <- ftable(xtabs(y~group+level+time, data=rag)) 
attributes(ft)$col.vars <- list(c("mean","sd"))

This gives:

> ft 
             mean sd
group level         
1     A        11  1
      B        10  3
      C        10  1
2     A         9  1
      B        11  2
      C         8  0
3     A        10  1
      B        10  2
      C        10  0

Beautiful, isn't it? You can export this table to a pdf with the textplot() function of the gplots package.

See here for others' solutions.

How to summarize data by group in R?

9 Answers9

Linked