0

I've been doing some multivariate analysis for a dataset that contains, for the most part, categorical data. For example, I have two which are:

  1. gender (M or F)
  2. state (A, B or C)

and each observation is a person.

With the objective of finding correlation between gender and state, my first naive attempt was to create dummy variables that act as indicators. Hence, I created 5 columns: $gender_M$, $gender_F$, $state_A$, $state_B$, $state_C$. To illustrate, if an entry was:

\begin{array} {|r|r|} \hline person & gender & state \\ \hline rick & M &B \\ \hline \end{array}

it became:

\begin{array} {|r|r|} \hline person & gender_M & gender_F & state_A & state_B & state_C \\ \hline rick & 1 & 0 & 0 & 1 & 0 \\ \hline \end{array}

To my understanding, I'm creating 5 Bernoulli random variables (one for each new column). Hence, I can now apply traditional methods to calculate expectation, covariance, correlation and etc.

The thing is: is this wrong in some sense? The model looks ok to me. Am I missing out on something that would allow me to extract better insights from my data? Is it wrong to expect Pearson's Correlation to give me good results (literature says to use Cramér's V)?

Thanks!

  • 1
    Following your example, it seems to me that your question is about the correlation between two categorical variables, and transforming these variables is just a step to accomplish this (so in this sense, the title is not very informative). As you can see in [this answer](https://stats.stackexchange.com/a/112674/109647), if you are looking for a test of significance of the association between these two categorical (nominal) variables and a measure of strength of their association, chi-squared test and Cramer’s V (as you mentioned at the end) might be enough. – T.E.G. Jun 14 '17 at 04:13
  • They're *binary* variables, not Bernoulli. Bernoulli relates to a probability distribution, while these needn't be random. What are you trying to find out? – Glen_b Jun 14 '17 at 09:46
  • @Glen_b right, I phrased it wrong. My thought was: I can model my dataset as a multivariate Bernoulli R.V. – Daniel Severo Jun 14 '17 at 12:51
  • @T.E.G. my actual question was regarding to thinking of my transformed dataset as a multivariate Bernoulli variable. The correlation talk was just to illustrate the results I could obtain if my thoughts were correct. – Daniel Severo Jun 14 '17 at 12:53

0 Answers0