11

I want to calculate a summary of a variable in a data.frame for each unique combination of factors in the data.frame. Should I use plyr to do this? I am ok with using loops as opposed to apply() ; so just finding out each unique combination would be enough.

russellpierce
  • 17,079
  • 16
  • 67
  • 98
  • 1
    Question is misleading as you ask about unique combinations of factors and then in details you ask about summary by unique combinations. – Wojtek Aug 17 '10 at 05:59

6 Answers6

11

See aggregate and by. For example, from the help file for aggregate:

## Compute the averages according to region and the occurrence of more
## than 130 days of frost.
aggregate(state.x77,
      list(Region = state.region,
           Cold = state.x77[,"Frost"] > 130),
      mean)
Aniko
  • 10,209
  • 29
  • 32
7

While I think aggregate is probably the solution you are seeking, if you are want to create an explicit list of all possible factor combinations, expand.grid will do that for you. e.g.

> expand.grid(height = seq(60, 80, 5), weight = seq(100, 300, 50),
             sex = c("Male","Female"))
       height weight    sex
1      60    100   Male
2      65    100   Male
... 
30     80    100 Female
31     60    150 Female

You could then loop over each row in the resulting data frame to pull out records from your original data.

3

Here's the plyr solution, which has the advantage of returning multiple summary stats and producing a progress bar for long computes:

library(ez) #for a data set
data(ANT)
cell_stats = ddply(
    .data = ANT #use the ANT data
    , .variables = .(cue,flanker) #uses each combination of cue and flanker
    , .fun = function(x){ #apply this function to each combin. of cue & flanker
        to_return = data.frame(
            , acc = mean(x$acc)
            , mrt = mean(x$rt[x$acc==1])
        )
        return(to_return)
    }
    , .progress = 'text'
)
Mike Lawrence
  • 12,691
  • 8
  • 40
  • 65
  • Thank You! This worked, although I had to drop a comma in the call to data.frame. stats = ddply( .data = ords , .variables = .(Symbol,SysID,Hour) , .fun = function(x){ to_return = data.frame( s = sum(x$Profit) , m = mean(x$Profit) ) return(to_return) } , .progress = 'text' ) –  Aug 16 '10 at 16:06
1

I personally like cast(), from the reshape package because of it's simplicity:

library(reshape)
cast(melt(tips), sex ~ smoker | variable, c(sd,mean, length))
Brandon Bertelsen
  • 6,672
  • 9
  • 35
  • 46
1

In addition to other suggestions you may find the describe.by() function in the psych package useful. It can be used to show summary statistics on numeric variables across levels of a factor variable.

Jeromy Anglim
  • 42,044
  • 23
  • 146
  • 250
1

In library(doBy) there is also the summaryBy() function, e.g.

summaryBy(DV1 + DV2 ~ Height+Weight+Sex,data=my.data)
slhck
  • 787
  • 2
  • 8
  • 20
russellpierce
  • 17,079
  • 16
  • 67
  • 98