I've been doing some multivariate analysis for a dataset that contains, for the most part, categorical data. For example, I have two which are:
- gender (M or F)
- state (A, B or C)
and each observation is a person.
With the objective of finding correlation between gender and state, my first naive attempt was to create dummy variables that act as indicators. Hence, I created 5 columns: $gender_M$, $gender_F$, $state_A$, $state_B$, $state_C$. To illustrate, if an entry was:
\begin{array} {|r|r|} \hline person & gender & state \\ \hline rick & M &B \\ \hline \end{array}
it became:
\begin{array} {|r|r|} \hline person & gender_M & gender_F & state_A & state_B & state_C \\ \hline rick & 1 & 0 & 0 & 1 & 0 \\ \hline \end{array}
To my understanding, I'm creating 5 Bernoulli random variables (one for each new column). Hence, I can now apply traditional methods to calculate expectation, covariance, correlation and etc.
The thing is: is this wrong in some sense? The model looks ok to me. Am I missing out on something that would allow me to extract better insights from my data? Is it wrong to expect Pearson's Correlation to give me good results (literature says to use Cramér's V)?
Thanks!