17

In R, I have a data frame comprising a class label C (a factor) and two measurements, M1 and M2. How do I compute the correlation between M1 and M2 within each class?

Ideally, I'd get back a data frame with one row for each class and two columns: the class label C and the correlation.

csgillespie
  • 11,849
  • 9
  • 56
  • 85
NPE
  • 5,351
  • 5
  • 33
  • 44

6 Answers6

20

The package plyr is the way to go.

Here is a simple solution:

xx <- data.frame(group = rep(1:4, 100), a = rnorm(400) , b = rnorm(400) )
head(xx)

require(plyr)
func <- function(xx)
{
return(data.frame(COR = cor(xx$a, xx$b)))
}

ddply(xx, .(group), func)

The output will be:

  group         COR
1     1  0.05152923
2     2 -0.15066838
3     3 -0.04717481
4     4  0.07899114
Tal Galili
  • 19,935
  • 32
  • 133
  • 195
  • 1
    (+1) Nice `plyr` package, isn't it? :) – chl Oct 28 '10 at 08:47
  • This works great. Thanks for pointing out the plyr package! Could you please explain the ".(group)" syntax? – NPE Oct 28 '10 at 14:31
  • 2
    aix - sure. It means "split the data by the variable between .(), and on each subset perform the function". In order to have it include more variables, you should simply use this syntax: .(var1, var2, var3) . Which is like cutting your data by each combination of levels of var1, var2 and var3. And on each cut to perform your function. This package is maintained by Hadley (also th author of ggplot2), so I trust it will keep developing. – Tal Galili Oct 29 '10 at 06:54
  • 2
    Oh, and BTW, you could also use plyr with a parallel computing on several cores (almost automatically), see: http://www.r-statistics.com/2010/09/using-the-plyr-1-2-package-parallel-processing-backend-with-windows/ – Tal Galili Oct 29 '10 at 06:55
  • @chl - indeed :) – Tal Galili Oct 29 '10 at 06:57
  • 1
    That's a nice answer, but I'm astonished there isn't a built-in solution for this, something like cor(x, y, by=z) would be so intuitive... – Waldir Leoncio Aug 01 '13 at 18:20
12

If you are inclined to use functions in the base package, you can use the by function, then reassemble the data:

xx <- data.frame(group = rep(1:4, 100), a = rnorm(400) , b = rnorm(400) )
head(xx)

# This returns a "by" object
result <- by(xx[,2:3], xx$group, function(x) {cor(x$a, x$b)})

# You get pretty close to what you want if you coerce it into a data frame via a matrix
result.dataframe <- as.data.frame(as.matrix(result))

# Add the group column from the row names
result.dataframe$C <- rownames(result)
hgcrpd
  • 1,307
  • 2
  • 11
  • 13
  • 1
    Nice, thanks! I've been experimenting with `by`, but couldn't figure out how to transform the result into a data frame. – NPE Oct 28 '10 at 14:38
9

Another example using base packages and Tal's example data:

DataCov <- do.call( rbind, lapply( split(xx, xx$group),
             function(x) data.frame(group=x$group[1], mCov=cov(x$a, x$b)) ) )
Joshua Ulrich
  • 1,376
  • 10
  • 16
  • Elegant solution Joshue. Do you think there are cases when one solution is better then another? – Tal Galili Oct 29 '10 at 16:05
  • 2
    I think it's a matter of preference. My example is essentially what `plyr` does but it gives you finer control, though it's not nearly as clean. My opinion would change if one solution had a better time/memory profile. I haven't compared them though. – Joshua Ulrich Oct 29 '10 at 16:08
  • How does this return the correlation? –  Aug 14 '15 at 10:44
2

Using data.table is shorter than dplyr

dt <- data.table(xx)
dtCor <- dt[, .(mCor = cor(M1,M2)), by=C]
Jeff
  • 3,525
  • 5
  • 27
  • 38
jp4711
  • 31
  • 1
0

Here's a more modern solution, using the dplyr package (which didn't yet exist when the question was asked):

Construct the input:

xx <- data.frame(group = rep(1:4, 100), a = rnorm(400) , b = rnorm(400) )

Compute the correlations:

library(dplyr)
xx %>%
  group_by(group) %>%
  summarize(COR=cor(a,b))

The output:

Source: local data frame [4 x 2]

  group         COR
  (int)       (dbl)
1     1  0.05112400
2     2  0.14203033
3     3 -0.02334135
4     4  0.10626273
Ken Williams
  • 1,670
  • 1
  • 12
  • 14
0

Here's a similar method that will give you a table with the n's and p values for each correlation as well (rounded to 3 decimal places for convenience):

library(Hmisc)
corrByGroup <- function(xx){
  return(data.frame(cbind(correl = round(rcorr(xx$a, xx$b)$r[1,2], digits=3),
                          n = rcorr(xx$a, xx$b)$n[1,2],
                          pvalue = round(rcorr(xx$a, xx$b)$P[1,2], digits=3))))
}
AnnaCM
  • 1