146

I have a dataframe with many observations and many variables. Some of them are categorical (unordered) and the others are numerical.

I'm looking for associations between these variables. I've been able to compute correlation for numerical variables (Spearman's correlation) but :

  • I don't know how to measure correlation between unordered categorical variables.
  • I don't know how to measure correlation between unordered categorical variables and numerical variables.

Does anyone know how this could be done? If so, are there R functions implementing these methods?

ttnphns
  • 51,648
  • 40
  • 253
  • 462
Clément F
  • 1,717
  • 4
  • 12
  • 13
  • 4
    http://stats.stackexchange.com/q/119835/3277; http://stats.stackexchange.com/q/73065/3277; http://stats.stackexchange.com/q/103253/3277. – ttnphns Sep 09 '16 at 14:17

6 Answers6

127

It depends on what sense of a correlation you want. When you run the prototypical Pearson's product moment correlation, you get a measure of the strength of association and you get a test of the significance of that association. More typically however, the significance test and the measure of effect size differ.

Significance tests:

Effect size (strength of association):

gung - Reinstate Monica
  • 132,789
  • 81
  • 357
  • 650
  • 7
    A very thorough explanation of the continuous vs. nominal case can be found here: [Correlation between a nominal (IV) and a continuous (DV) variable](http://stats.stackexchange.com/a/124618/). – gung - Reinstate Monica Dec 23 '14 at 16:35
  • 3
    In the binary vs interval case there's the [point-biserial correlation](https://en.wikipedia.org/wiki/Point-biserial_correlation_coefficient). – Glen_b Mar 12 '15 at 22:54
  • What would be a better alternative to the chi-squared test for large samples? – Waldir Leoncio Jul 19 '15 at 22:33
  • @WaldirLeoncio, "better" in what sense? What is wrong with the chi-squared if you want a test of independence? What constitutes a "large sample" for you? – gung - Reinstate Monica Jul 19 '15 at 23:32
  • Well, from what I've read and experienced, when whe sample size is in the tens of thousands, for instance, even small deviations from the expected frequencies — say, something a visual analysis would consider irrelevant — often result in very small $\left(10^{-16} \right)$ p-values. – Waldir Leoncio Jul 20 '15 at 12:08
  • 2
    @WaldirLeoncio, yes but if the null is true, $p$ will be $<.05 a="" along="" as="" calculate="" chi-squared="" cramer="" effect="" if="" is="" it="" know="" magnitude="" may="" null="" of="" only="" supposed="" test="" test.="" that="" the="" time.="" to="" v="" want="" way="" well="" with="" work.="" you=""> – gung - Reinstate Monica Jul 20 '15 at 12:58
  • 1
    As @gung pointed out, [Correlation between a nominal (IV) and a continuous (DV) variable](http://stats.stackexchange.com/questions/119835/correlation-between-a-nominal-iv-and-a-continuous-dv-variable/124618#124618) is an excellent link for how correlation for mixed variables can be done. `Hmisc::rcorr` does this beautifully and we can check it (for a mixed variables dataframe) as follows: `as.data.frame(rcorr(as.matrix(data_frame),type = "pearson")$P)` $\:$ `as.data.frame(rcorr(as.matrix(data_frame),type = "pearson")$r)` – KarthikS Oct 30 '15 at 18:23
  • @gung, my teacher told me `use L, C, Lambda when Nominal vs. Nominal` but you said use `chisq.test.`? – kittygirl May 12 '19 at 02:13
  • @kittygirl, I don't know what `L, C, Lambda` are (for nominal vs nominal, or anything else). I do say to use a chi-squared test to test for an association between two nominal variables, as you say & can see above. – gung - Reinstate Monica May 12 '19 at 12:49
  • @gung,have a look at https://www.andrews.edu/~calkins/math/edrm611/edrm13.htm – kittygirl May 12 '19 at 16:20
  • For an R implementation that calculates the strength of association for nominal vs nominal with a bias-corrected Cramer's V, numeric vs numeric with Spearman (default) or Pearson correlation, and nominal vs numeric with ANOVA see https://stackoverflow.com/a/56485520/590437 – Holger Brandl Jun 07 '19 at 09:08
15

I've seen the following cheatsheet linked before:

https://stats.idre.ucla.edu/other/mult-pkg/whatstat/

It may be useful to you. It even has links to specific R libraries.

Kartik
  • 113
  • 6
DSea
  • 404
  • 2
  • 3
  • 3
    The issue with this cheatsheet is it only concerns categorical / ordinal / interval variables. What I'm looking for is a method allowing me to use both numerical and categorical independant variables. – Clément F Jul 17 '14 at 14:01
9

If you want a correlation matrix of categorical variables, you can use the following wrapper function (requiring the 'vcd' package):

catcorrm <- function(vars, dat) sapply(vars, function(y) sapply(vars, function(x) assocstats(table(dat[,x], dat[,y]))$cramer))

Where:

vars is a string vector of categorical variables you want to correlate

dat is a data.frame containing the variables

The result is a matrix of Cramer's V's.

Dan
  • 447
  • 3
  • 10
6

Depends on what you want to achieve. Let $X$ be the continuous, numerical variable and $K$ the (unordered) categorical variable. Then one possible approach is to assign numerical scores $t_i$ to each of the possible values of $K$, $i=1, \dots, p$. One possible criterion is to maximize the correlation between the $X$ and the scores $t_i$. With only one continuous and one categorical variable, this might not be very helpful, since the maximum correlation will always be one (to show that, and find some such scores, is an exercise in using Lagrange multipliers! With multiple variables, we try to find compromise scores for the categorical variables, maybe trying to maximize the multiple correlation $R^2$. Then the individual correlations will not more (except very special cases!) equal one.

Such an analysis can be seen as a generalization of multiple correspondence analysis, and is known under many names, such as canonical correlation analysis, homogeneity analysis, and many others. An implementation in R is in the homals package (on CRAN). googling for some of this names will give a wealth of information, there is a complete book: Albert Gifi, "Nonlinear Multivariate Analysis". Good luck!

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
  • 1
    (+1) Why use Lagrange multipliers? Just use the values of the continuous variable to score the categorical one. This also reveals why the max correlation is not necessarily $1$, which is attainable only when each category is paired with an unvarying set of values of the continuous variable. – whuber Nov 17 '14 at 10:04
  • I will edit to take into account this comment. – kjetil b halvorsen Sep 09 '16 at 13:41
2

I had a similar problem and I tried the Chi-squared-Test as suggested but I got very confused in assessing the P-Values against NULL Hypothesis.

I will explain how I interpreted categorical variables. I am not sure how relevant it is in your case. I had Response Variable Y and two Predictor Variables X1 and X2 where X2 being a categorical variable with two levels say 1 and 2. I was trying to fit a Linear Model

ols = lm(Y ~ X1 + X2, data=mydata)

But I wanted to understand how different level of X2 fits the above equation. I came across a R function by()

by(mydata,X2,function(x) summary(lm(Y~X1,data=x)))

What this code does is, it is trying to fit in Linear Model for each level of X2. This gave me all P-value and R-square, Residual standard error which I understand and can interpret.

Again I am not sure if this is what you want. I sort of compared different values of X2 in predicting Y.

Sohsum
  • 169
  • 1
  • 5
1

To measure the link strength between two categorical variable i would rather suggest the use of a cross tab with the chisquare stat

to measure the link strength between a numerical and a categorical variable you can use a mean comparison to see if it change significally from one category to an others

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467