3

What is the best (most simple and robust) test statistic to measure the overall degree of association (inter-dependence, correlation or covariance?) between multiple binary variables?

I have been looking at multiple regression, but I think this is too complex as it is used to model the actual relationship for prediction, rather than to measure the degree of correlation.

So let's say we have k binary (binomial) variables, and a sample size of n observations per variable, where each variable occurs (positive case) at a given frequency/probability f.

How would we measure the degree of correlation between these variables, and how does the p-value of that metric depend on n, k and f?

Kelvin
  • 1,051
  • 9
  • 18
  • Why not just calculate the pairwise sample correlations directly? – dsaxton Mar 24 '16 at 14:17
  • Because that wouldn't give a single, overall correlation metric, which is what I would like to do probability analysis on to test for significance. – Kelvin Mar 24 '16 at 14:19
  • 2
    http://stats.stackexchange.com/questions/103801/is-it-meaningful-to-calculate-pearson-or-spearman-correlation-between-two-boolea may help. Think geometrically: coding with 0 and 1 there are four points (0, 0), (0, 1), (1, 0) and (1, 1) to bivariate data. A correlation makes sense so long as both variables are not constant. Whether it is the best method for your purpose is a different question. – Nick Cox Mar 24 '16 at 14:22
  • If one binary variable is a response and you want to think of others as predictors, start with logit, not linear regression. – Nick Cox Mar 24 '16 at 14:24
  • Hi Nick, yes, I have been looking at logistic regression, but I think it's too complicated for what I need, which is just a simple test correlation statistic so that I can do p-value analysis (no need to model the actual correlation). – Kelvin Mar 24 '16 at 14:26
  • If you really want a P-value, I suggest that you use Fisher's test. – Nick Cox Mar 24 '16 at 14:29
  • How would the p-value depend on n, k and f in the Fisher's test? – Kelvin Mar 24 '16 at 14:34
  • 1
    Are you saying that you want a **single** measure summarizing all the bivariate relationships among several variables **simultaneously**? I can't see much meaning to that if so. Perhaps the road leads to some flavour of correspondence analysis. – Nick Cox Mar 24 '16 at 14:45
  • In essence, yes. I guess one could think of it as the overall degree of connectivity in a network of variables, so it does have some meaning. – Kelvin Mar 24 '16 at 14:49
  • 1
    I think some people use principal component analysis (PCA) on binary variables. I think that divides the experts on whether it is sound. There may be some threads here. – Nick Cox Mar 24 '16 at 15:00

2 Answers2

2

First, whatever you use it won't be correlation. Correlation is about two variables.

Second, there is no simple way to do this because "degree of association" is not easily defined with multiple variables.

Third, as @NickCox commented yesterday, some people do principal components analysis on binary data but 1) This isn't simple 2) It's a bit controversial and 3) It may not give you what you want.

Fourth, have you considered log-linear analysis? This is a sort of generalization of chi-square: It makes no assumption about a dependent variable.

Peter Flom
  • 94,055
  • 35
  • 143
  • 276
  • Thanks, I will look into log-linear analysis. Do you have any references (links) on how it could be applied in this case? Otherwise I'll just have to feel and bump my way around as I've been doing... – Kelvin Mar 26 '16 at 12:58
  • I don't have any particularly good links; Googling will find lots of stuff. – Peter Flom Mar 26 '16 at 13:07
0

That would be Cramers V, a measure for dependencies between two nominal variables: https://en.m.wikipedia.org/wiki/Cram%C3%A9r%27s_V

The Pearson chi-squared test is the way to determine if the dependency is significant: https://en.m.wikipedia.org/wiki/Pearson%27s_chi-squared_test

Pieter
  • 1,847
  • 9
  • 23
  • Thanks, Pieter, is there are way to calculate minimum sample size, given that some variables may occur with very low frequency (and so may not appear at all, if the sample is too small)?? Also, how does it work for multiple binomial variables? – Kelvin Mar 24 '16 at 14:39
  • You mean a power computation? Or just the chance that a variable has at least one positive given the sample size and the probability of one item being positive? – Pieter Mar 24 '16 at 14:43
  • Yes, I mean a power calculation. Ideally, I am looking to understand how the minimum sample size for such a correlation test depends on the number of binomial variables k, their frequency f, and any other critical parameters (e.g., alpha, beta, R^2, etc.). – Kelvin Mar 24 '16 at 14:45
  • After some research, I don't think Cramer's V can be used to measure correlation between multiple (k>2) binomial variables, can it? It seems to be geared only for 2 variables, in 2-dimensional contingency tables... – Kelvin Mar 26 '16 at 12:08
  • Yes, that's correct. I did not notice that. It measures the dependency between 2 nominal variables. – Pieter Mar 26 '16 at 12:11
  • Thanks. So is there a way to measure overall interdependency between more than 2 binomial variables? – Kelvin Mar 26 '16 at 12:12