188

I have R data frame like this:

        age group
1   23.0883     1
2   25.8344     1
3   29.4648     1
4   32.7858     2
5   33.6372     1
6   34.9350     1
7   35.2115     2
8   35.2115     2
9   35.2115     2
10  36.7803     1
...

I need to get data frame in the following form:

group mean     sd
1     34.5     5.6
2     32.3     4.2
...

Group number may vary, but their names and quantity could be obtained by calling levels(factor(data$group))

What manipulations should be done with the data to get the result?

Yuriy Petrovskiy
  • 4,081
  • 7
  • 25
  • 30

9 Answers9

144

Here is the plyr one line variant using ddply:

dt <- data.frame(age=rchisq(20,10),group=sample(1:2,20,rep=T))
ddply(dt,~group,summarise,mean=mean(age),sd=sd(age))

Here is another one line variant using new package data.table.

dtf <- data.frame(age=rchisq(100000,10),group=factor(sample(1:10,100000,rep=T)))
dt <- data.table(dtf)
dt[,list(mean=mean(age),sd=sd(age)),by=group]

This one is faster, though this is noticeable only on table with 100k rows. Timings on my Macbook Pro with 2.53 Ghz Core 2 Duo processor and R 2.11.1:

> system.time(aa <- ddply(dtf,~group,summarise,mean=mean(age),sd=sd(age)))
utilisateur     système      écoulé 
      0.513       0.180       0.692 
> system.time(aa <- dt[,list(mean=mean(age),sd=sd(age)),by=group])
utilisateur     système      écoulé 
      0.087       0.018       0.103 

Further savings are possible if we use setkey:

> setkey(dt,group)
> system.time(dt[,list(mean=mean(age),sd=sd(age)),by=group])
utilisateur     système      écoulé 
      0.040       0.007       0.048 
mpiktas
  • 33,140
  • 5
  • 82
  • 138
  • 2
    @chl, it gave me a chance to try out this new **data.table** package. It looks really promising. – mpiktas Mar 15 '11 at 12:54
  • 7
    +6000 for data.table. It really is so much faster than ddply, even for me on datasets smaller than 100k (I have one with just 20k rows). Must be something to do with the functions I am applying, but ddply will take minutes and data.table a few seconds. – atomicules Sep 22 '11 at 15:22
  • Simple typo: I think you meant `dt – Christopher Bottoms Oct 24 '14 at 18:50
  • 1
    In my (not humble in this case) opinion `data.table` is the best way to aggregate data and this answer is great, but still only scratches the surface. Aside from being syntactically superior, it's also extremely flexible and has many advanced features that involve joins and internal mechanics. Check out the FAQ, github page, or course for more info. – geneorama Oct 29 '14 at 03:47
103

One possibility is to use the aggregate function. For instance,

aggregate(data$age, by=list(data$group), FUN=mean)[2]

gives you the second column of the desired result.

gung - Reinstate Monica
  • 132,789
  • 81
  • 357
  • 650
ocram
  • 19,898
  • 5
  • 76
  • 77
  • 1
    Don't link to your local help server :-) +1 but see my comments to @steffen's response. – chl Mar 13 '11 at 12:26
  • Done the thing by calling `data.frame(group=levels(factor(data$group)),mean=(aggregate(data$age, by=list(data$group), FUN=mean)$x),sd=(aggregate(data$age, by=list(data$group), FUN=sd)$x))` but I am not shure it is correct way. I am not sure what will happen then the results of binded columns will be in different order (I think it is possible). What is your oppinion? – Yuriy Petrovskiy Mar 13 '11 at 12:46
  • 9
    @Yuriy The rows should not be out of order, but here is a way to do it one call to `aggregate()`: `aggregate(age ~ group, data=dat, FUN = function(x) c(M=mean(x), SD=sd(x)))` – lockedoff Mar 14 '11 at 16:51
  • @lockedoff: Thank you for having completed my answer! – ocram Mar 15 '11 at 10:22
26

Since you are manipulating a data frame, the dplyr package is probably the faster way to do it.

library(dplyr)
dt <- data.frame(age=rchisq(20,10), group=sample(1:2,20, rep=T))
grp <- group_by(dt, group)
summarise(grp, mean=mean(age), sd=sd(age))

or equivalently, using the dplyr/magrittr pipe operator:

library(dplyr)
dt <- data.frame(age=rchisq(20,10), group=sample(1:2,20, rep=T))
group_by(dt, group) %>%
 summarise(mean=mean(age), sd=sd(age))

EDIT full use of pipe operator:

library(dplyr)
data.frame(age=rchisq(20,10), group=sample(1:2,20, rep=T)) %>%
  group_by(group) %>%
  summarise(mean=mean(age), sd=sd(age))
Bastiaan Quast
  • 450
  • 5
  • 12
11

Great, thanks bquast for adding the dplyr solution!

Turns out that then, dplyr and data.table are very close:

library(plyr)
library(dplyr)
library(data.table)
library(rbenchmark)

dtf <- data.frame(age=rchisq(100000,10),group=factor(sample(1:10,100000,rep=T)))
dt <- data.table(dtf)

setkey(dt,group)

a<-benchmark(ddply(dtf,~group,plyr:::summarise,mean=mean(age),sd=sd(age)),
         dt[,list(mean=mean(age),sd=sd(age)),by=group],
         group_by(dt, group) %>% summarise(mean=mean(age),sd=sd(age) ),
         group_by(dtf, group) %>% summarise(mean=mean(age),sd=sd(age) )
)

a[, c(1,3,4)]

data.table is still the fastest, by followed very closely by dplyr(), which interestingly seems faster on the data.frame than the data.table:

                                                              test elapsed relative
1 ddply(dtf, ~group, plyr:::summarise, mean = mean(age), sd = sd(age))   1.689    4.867
2               dt[, list(mean = mean(age), sd = sd(age)), by = group]   0.347    1.000
4   group_by(dtf, group) %>% summarise(mean = mean(age), sd = sd(age))   0.369    1.063
3    group_by(dt, group) %>% summarise(mean = mean(age), sd = sd(age))   0.580    1.671
Matifou
  • 2,699
  • 14
  • 25
  • At first I thought you needed to move setkey into the benchmark, but turns out that takes almost no time at all. – kasterma Oct 16 '14 at 14:27
11

In addition to existing suggestions, you might want to check out the describe.by function in the psych package.

It provides a number of descriptive statistics including the mean and standard deviation based on a grouping variable.

Jeromy Anglim
  • 42,044
  • 23
  • 146
  • 250
10

I have found the function summaryBy in the doBy package to be the most convenient for this:

library(doBy)

age    = c(23.0883, 25.8344, 29.4648, 32.7858, 33.6372,
           34.935,  35.2115, 35.2115,  5.2115, 36.7803)
group  = c(1, 1, 1, 2, 1, 1, 2, 2, 2, 1)
dframe = data.frame(age=age, group=group)

summaryBy(age~group, data=dframe, FUN=c(mean, sd))
# 
#   group age.mean    age.sd
# 1     1 30.62333  5.415439
# 2     2 27.10507 14.640441
gung - Reinstate Monica
  • 132,789
  • 81
  • 357
  • 650
8

Use the sqldf package. This allows you now to use SQL to summarize the data. Once you load it you can write something like -

sqldf('  select group,avg(age) from data group by group  ')
KalEl
  • 559
  • 3
  • 11
8

Edited: According to chl's suggestions

The function you are looking for is called "tapply" which applies a function per group specified by a factor.

# create some artificial data
set.seed(42)
groups <- 5

agedat <- c()
groupdat <- c()

for(group in 1:groups){
    agedat <- c(agedat,rnorm(100,mean=0 + group,1/group))
    groupdat <- c(groupdat,rep(group,100))
}
dat <- data.frame("age"=agedat,"group"=factor(groupdat))

# calculate mean and stdev age per group
res <- rbind.data.frame(group=1:5, with(dat, tapply(age, group, function(x) c(mean(x), sd(x)))))
names(res) <- paste("group",1:5)
row.names(res)[2:3] <- c("mean","sd")

I really suggest to work through a basic R tutorial explaining all commonly used datastructures and methods. Otherwise you will get stuck every inch during programming. See this question for a collection of free available resources.

mlwida
  • 9,922
  • 2
  • 45
  • 74
  • 2
    @steffen +1 but there's no need for a `for` loop here, you can contruct your dataframe inline, IMO. For the `tapply` call, use `function(x) c(mean(x),sd(x)))` and `cbind` the result as the OP asked for both statistics. Also, `ddply` from the [plyr](http://had.co.nz/plyr/) package could do this smoothly. – chl Mar 13 '11 at 12:24
  • @steffen The problem is I need the exactly the table structure I described. There is no problem with getting means and sd. The problem is with stucture. – Yuriy Petrovskiy Mar 13 '11 at 12:35
  • @chl: Thank you for your comment, did not know about plyr :). I added cbind, but left the rest untouched. May another one take the credit, this answer shall remain as a less optimal example. – mlwida Mar 13 '11 at 12:35
  • @Yuriy: Added cbind. If you already knew how to apply functions per group, you may reformulate your question (just for clarity ;)). – mlwida Mar 13 '11 at 12:37
  • @steffen `cbind("mean"=mperage,"stdev"=stperage) gives no 'group' column. Will be joining by `cbind(group=levels(factor(data$group)),"mean"=mperage,"stdev"=stperage)` correct? – Yuriy Petrovskiy Mar 13 '11 at 12:51
  • @Yuriy: The row names are the group-names .... Nevertheless I added a separate column (now using data.frame) to show this. – mlwida Mar 13 '11 at 12:51
  • @steffen Are you sure there will be no ordering issues when joining columns? – Yuriy Petrovskiy Mar 13 '11 at 12:53
  • @Yuriy, tapply orders the results by the order of the levels, so there ordering issues will not arise. @steffen, I upvoted the answer for the effort, before I knew plyr, I used to solve such problems in similar way. – mpiktas Mar 13 '11 at 13:24
  • @steffen Try to replace your last three lines of code with: `t(rbind.data.frame(group=1:5, with(dat, tapply(age, group, function(x) c(mean(x), sd(x))))))` (+ rename the last two columns as `mean` and `sd`). – chl Mar 13 '11 at 15:36
7

Here is an example with the function aggregates() I did myself some time ago:

# simulates data
set.seed(666)
( dat <- data.frame(group=gl(3,6), level=factor(rep(c("A","B","C"), 6)), 
                    y=round(rnorm(18,10),1)) )

> dat
   group level    y
1      1     A 10.8
2      1     B 12.0
3      1     C  9.6
4      1     A 12.0
5      1     B  7.8
6      1     C 10.8
7      2     A  8.7
8      2     B  9.2
9      2     C  8.2
10     2     A 10.0
11     2     B 12.2
12     2     C  8.2
13     3     A 10.9
14     3     B  8.3
15     3     C 10.1
16     3     A  9.9
17     3     B 10.9
18     3     C 10.3

# aggregates() function
aggregates <- function(formula, data=NULL, FUNS){ 
    if(class(FUNS)=="list"){ 
        f <- function(x) sapply(FUNS, function(fun) fun(x)) 
    }else{f <- FUNS} 
    temp <- aggregate(formula, data, f) 
    out <- data.frame(temp[,-ncol(temp)], temp[,ncol(temp)]) 
    colnames(out)[1] <- colnames(temp)[1] 
return(out) 
} 

# example 
FUNS <- function(x) c(mean=round(mean(x),0), sd=round(sd(x), 0)) 
( ag <- aggregates(y~group:level, data=dat, FUNS=FUNS) ) 

It gives the following result:

> ag
  group level mean sd
1     1     A   11  1
2     2     A    9  1
3     3     A   10  1
4     1     B   10  3
5     2     B   11  2
6     3     B   10  2
7     1     C   10  1
8     2     C    8  0
9     3     C   10  0

Maybe you can get the same result starting from the R function split():

> with(dat, sapply( split(y, group:level), FUNS ) )
     1:A 1:B 1:C 2:A 2:B 2:C 3:A 3:B 3:C
mean  11  10  10   9  11   8  10  10  10
sd     1   3   1   1   2   0   1   2   0

Let me come back to the output of the aggregates function. You can transform it in a beautiful table using reshape(), xtabs() and ftable():

rag <- reshape(ag, varying=list(3:4), direction="long", v.names="y") 
rag$time <- factor(rag$time) 
ft <- ftable(xtabs(y~group+level+time, data=rag)) 
attributes(ft)$col.vars <- list(c("mean","sd")) 

This gives:

> ft 
             mean sd
group level         
1     A        11  1
      B        10  3
      C        10  1
2     A         9  1
      B        11  2
      C         8  0
3     A        10  1
      B        10  2
      C        10  0

Beautiful, isn't it? You can export this table to a pdf with the textplot() function of the gplots package.

See here for others' solutions.

gung - Reinstate Monica
  • 132,789
  • 81
  • 357
  • 650
Stéphane Laurent
  • 17,425
  • 5
  • 59
  • 101