Correlation between numerical and categorical data in R

Question

I have a dataset with over 20 variables. Some of them are numerical and some of them are categorical:

C          <- c(4, 8.5, 2, 5, 6)
N          <- c(0.4, 0.1, 0.5, 1.2, 1.1)
moisture   <- as.factor(cbind("dry","dry","dry","wet","wet"))
vegetation <- as.factor(cbind("forest", "wetland", "field", "forest", "wetland"))
df         <- data.frame(C,N, moisture,vegetation)

I want to know the pairwise correlation between each of these variables. I found two solutions for this: rcorr() and hetcor(). While rcorr gives me Pearsons's product-moment correlation or Spearman's rho rank correlation including p-values, hetcor() offers me the discrimination into polyserial and polychoric correlations, but no p-values.

I would use rcorr with Pearson which has the advantage of also including p-values, but I am not sure if it qualifies for this sort of data. Can I still talk of correlations in this case or do I need to talk about significance of association? If I use hetcor I seem to gain the advantage of it being applicable for categorical data, but I don't get the p-values.

This is probably a duplicate of [Correlations with categorical variables](http://stats.stackexchange.com/q/108007/7290). I think you will find the information you need there. Please read it. If it isn't what you want / you still have a question afterwards, come back here & edit your question to state what you learned & what you still need to know. Then we can provide the information you need without just duplicating material elsewhere that already didn't help you. — gung - Reinstate Monica, Dec 21 '15 at 21:15
I agree fully with @gung, you might also want to look at [Correlation between a nominal (IV) and a continuous (DV) variable](http://stats.stackexchange.com/questions/119835/correlation-between-a-nominal-iv-and-a-continuous-dv-variable/124618#124618) — Silverfish, Dec 21 '15 at 22:10
Ok, thanks for your replies. What I take from this is that neither `hetcor()` nor `rcorr` make sense in my case and that I should rather work with a series of linear models or using η from ANOVA following your [link](http://stats.stackexchange.com/a/124618/34707). Therefor this [post](http://stats.stackexchange.com/a/179458/34707) seems to be missleading and wrong or is it still possible to use Pearson even though there is no hierarchy in my categorial data? — mace, Dec 21 '15 at 23:03
@mace please see my answer, correlation with categorical unordered variable makes no sens. It is not really clear what does author of the post you refer to means and how does the answer refer to correlation with categorical data. The code provided in this post would not return any *meaningful* output for unordered categorical data. — Tim, Dec 22 '15 at 19:58

score 5 · Accepted Answer · answered Dec 21 '15 at 21:11

From hetcor documentation you can learn that

Computes a heterogenous correlation matrix, consisting of Pearson product-moment correlations between numeric variables, polyserial correlations between numeric and ordinal variables, and polychoric correlations between ordinal variables.

It computes correlation in case where one or two of the variables are ordinal, i.e. categorical where categories can be ordered in a meaningful way. Categories: "forest", "wetland", "field" cannot be ordered (at least I cannot imagine any meaningful way for it). Correlation measures a linear relation (or lack of it) such that one of the variables increases when the other one increases (positive correlation), or one of the variables increases when the other one decreases (negative correlation). There is no increase or decrease between "forest" and "wetland" etc., so you cannot measure such linear relation for categorical variable. See also here for discussion of similar case where order of categories makes a difference.

See also Should types of data (nominal/ordinal/interval/ratio) really be considered types of variables?.

Correlation between numerical and categorical data in R

1 Answers1

Linked