3

I have a mix of predictors that are numerical and categorical. Among the numerical predictors, it is easy to calculate the correlation (Spearman, Pearson). Among categorical data, I know a few (Cramers V). Is there a way to calculate the correlation among numerical AND categorical data?

I wanted to combine the two types of data sets into one big data set. Is there a way to calculate the correlation among these variables, regardless of being numerical/categorical?

cgo
  • 7,445
  • 10
  • 42
  • 61

2 Answers2

2

Create N-1 binary dummy variables for your N categorical variables.

https://dss.princeton.edu/online_help/analysis/dummy_variables.htm

This is also nearly the same as one-hot encoding:

https://towardsdatascience.com/categorical-encoding-using-label-encoding-and-one-hot-encoder-911ef77fb5bd

https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/

You can then use your favorite regression technique to find correlations between columns, which are now numerical.

user3433489
  • 353
  • 1
  • 8
  • 1
    "dummy" (=indicator) variables and one-hot variables are not "nearly" the same, they are synonyms. One word being used in data analysis/statistics, the other - in machine learning/computer science. – ttnphns Apr 28 '20 at 18:15
  • They are binary depending on the category, with one-hot having N columns for N categories, and dummy having N-1 columns. Right? If that's the only difference, I'd call them nearly the same. – user3433489 Apr 29 '20 at 01:49
  • Dummies can be all N variables as well. We usually _use_ N-1 of them, though. It is incorrect to define dummy set as consisting strictly of N-1. – ttnphns Apr 29 '20 at 01:54
  • So in other words, dummy and one-hot are sometimes exactly the same, and usually nearly the same. Thanks. – user3433489 Apr 29 '20 at 02:02
0

One approach for learning about covariance (or correlation) among several variables of mixed type and with possibly non-normal distributions is to treat the data as functions of some underlying multivariate Gaussian random variable.

If your categorical variables are dichotomous, you can encode it as a binary indicator variable. If your categorical variables are not ordinal and have, say, N-levels, you will need to expand your categorical data into a set of N dummy variables in a procedure sometimes known as one-hot encoding.

Suppose you began with a two-dimensional data set, the first variable was continuous and numeric while the second variable was a 3-level nominal categorical variable. After expanding the categorical variable into 3 binary indicator vectors, your data will be 4-dimensional, and you can treat the data as a function of some underlying 4-dimensional Gaussian random variable. The covariance structure of the underlying Gaussian distribution will characterize the relationship among all the variables in your data.

The R package "MCMCpack" includes functions for fitting Gaussian copula models.

Hoff (2007) "Extending the rank likelihood for semiparametric copula estimation" https://projecteuclid.org/euclid.aoas/1183143739 may be useful to you. It describes a semiparametric Gaussian Copula model that accommodates mixed continuous and discrete ordinal data. Also, perhaps Muthen (83) "Latent variable structural equation modeling with categorical data" https://www.sciencedirect.com/science/article/pii/0304407683900933 or Quinn (04) "Bayesian Factor Analysis for Mixed Ordinal and Continuous Responses" https://www.law.berkeley.edu/files/pa04.pdf may provide insights for this problem.

David Buch
  • 51
  • 4