R: compute correlation by group

Question

In R, I have a data frame comprising a class label C (a factor) and two measurements, M1 and M2. How do I compute the correlation between M1 and M2 within each class?

Ideally, I'd get back a data frame with one row for each class and two columns: the class label C and the correlation.

score 20 · Accepted Answer · answered Oct 28 '10 at 08:40

20

The package plyr is the way to go.

Here is a simple solution:

xx <- data.frame(group = rep(1:4, 100), a = rnorm(400) , b = rnorm(400) )
head(xx)

require(plyr)
func <- function(xx)
{
return(data.frame(COR = cor(xx$a, xx$b)))
}

ddply(xx, .(group), func)

The output will be:

  group         COR
1     1  0.05152923
2     2 -0.15066838
3     3 -0.04717481
4     4  0.07899114

answered Oct 28 '10 at 08:40

Tal Galili

19,935
32
133
195

1

(+1) Nice `plyr` package, isn't it? :) – chl Oct 28 '10 at 08:47
This works great. Thanks for pointing out the plyr package! Could you please explain the ".(group)" syntax? – NPE Oct 28 '10 at 14:31
2

aix - sure. It means "split the data by the variable between .(), and on each subset perform the function". In order to have it include more variables, you should simply use this syntax: .(var1, var2, var3) . Which is like cutting your data by each combination of levels of var1, var2 and var3. And on each cut to perform your function. This package is maintained by Hadley (also th author of ggplot2), so I trust it will keep developing. – Tal Galili Oct 29 '10 at 06:54
2

Oh, and BTW, you could also use plyr with a parallel computing on several cores (almost automatically), see: http://www.r-statistics.com/2010/09/using-the-plyr-1-2-package-parallel-processing-backend-with-windows/ – Tal Galili Oct 29 '10 at 06:55
@chl - indeed :) – Tal Galili Oct 29 '10 at 06:57
1

That's a nice answer, but I'm astonished there isn't a built-in solution for this, something like cor(x, y, by=z) would be so intuitive... – Waldir Leoncio Aug 01 '13 at 18:20

score 12 · Answer 2 · answered Oct 28 '10 at 10:46

If you are inclined to use functions in the base package, you can use the by function, then reassemble the data:

xx <- data.frame(group = rep(1:4, 100), a = rnorm(400) , b = rnorm(400) )
head(xx)

# This returns a "by" object
result <- by(xx[,2:3], xx$group, function(x) {cor(x$a, x$b)})

# You get pretty close to what you want if you coerce it into a data frame via a matrix
result.dataframe <- as.data.frame(as.matrix(result))

# Add the group column from the row names
result.dataframe$C <- rownames(result)

Nice, thanks! I've been experimenting with `by`, but couldn't figure out how to transform the result into a data frame. — NPE, Oct 28 '10 at 14:38

score 9 · Answer 3 · answered Oct 29 '10 at 15:23

9

Another example using base packages and Tal's example data:

DataCov <- do.call( rbind, lapply( split(xx, xx$group),
             function(x) data.frame(group=x$group[1], mCov=cov(x$a, x$b)) ) )

answered Oct 29 '10 at 15:23

Joshua Ulrich

1,376
10
16

Elegant solution Joshue. Do you think there are cases when one solution is better then another? – Tal Galili Oct 29 '10 at 16:05
2

I think it's a matter of preference. My example is essentially what `plyr` does but it gives you finer control, though it's not nearly as clean. My opinion would change if one solution had a better time/memory profile. I haven't compared them though. – Joshua Ulrich Oct 29 '10 at 16:08
How does this return the correlation? – Aug 14 '15 at 10:44

score 2 · Answer 4 · edited Aug 14 '15 at 13:27

2

Using data.table is shorter than dplyr

dt <- data.table(xx)
dtCor <- dt[, .(mCor = cor(M1,M2)), by=C]

edited Aug 14 '15 at 13:27

Jeff

3,525
5
27
38

answered Aug 08 '15 at 04:13

jp4711

31
1

score 0 · Answer 5 · answered Oct 08 '15 at 21:14

Here's a more modern solution, using the dplyr package (which didn't yet exist when the question was asked):

Construct the input:

xx <- data.frame(group = rep(1:4, 100), a = rnorm(400) , b = rnorm(400) )

Compute the correlations:

library(dplyr)
xx %>%
  group_by(group) %>%
  summarize(COR=cor(a,b))

The output:

Source: local data frame [4 x 2]

  group         COR
  (int)       (dbl)
1     1  0.05112400
2     2  0.14203033
3     3 -0.02334135
4     4  0.10626273

score 0 · Answer 6 · edited Aug 14 '15 at 11:26

Here's a similar method that will give you a table with the n's and p values for each correlation as well (rounded to 3 decimal places for convenience):

library(Hmisc)
corrByGroup <- function(xx){
  return(data.frame(cbind(correl = round(rcorr(xx$a, xx$b)$r[1,2], digits=3),
                          n = rcorr(xx$a, xx$b)$n[1,2],
                          pvalue = round(rcorr(xx$a, xx$b)$P[1,2], digits=3))))
}

R: compute correlation by group

6 Answers6

Linked

Related