0

I do a text analysis where I want to identify dependencies among categorical variables, for example let's take this dataset:

  pos1 pos2 pos3
1    A    B    C
2    A    B    D
3    A    B    A
4    B    B    D
5    A    B    B

Here the columns indicate the position in the text and the rows indicate different texts. From this example it is obvious that A on position 1 is accompanied by B as position 2. I thought of calculating a correlation coefficient, such as pearson, however to do so I have to convert this data set to a binary matrix. Then considering this question, I think pearson or spearman would not be a good choice. Is there a way to calculate the association of these categorical variables, such that one can see that A at position 1 is commonly accompanied by B at position 2 for example?

CodeNoob
  • 201
  • 2
  • 7

1 Answers1

0

If both variables are binary you can use the Fisher's exact test which tests whether the variables are independent. This is how it works in R:

# Data
set.seed(1)
n <- 1000
data <- data.frame(pos1 = c(rep("A", n/2), rep("B", n/2)),
                   pos2 = sample(c("A", "B"), n, replace= TRUE))

# you must provide your data as a frequency table
data.table <- table(data)
data.table
    pos2
pos1   A   B
   A 270 230
   B 250 250

# run the test
fisher.test(data.table)
data:  data.table
p-value = 0.2291

In this case the p value is >.05, hence, you can not discard null hypothesis and you can't say that there is a association (as always: a not significant result doesn't tell you that the nullhypothesis is true, i.e. that the variables are independent).

Although you ask about the "correlation" between the variables pos1 and pos2 you provided a column called pos3 which has more than two groups. You can test the independence of two categorical variables with each >=2 groups with the Chi-square test of independence. The R code for a Chi-square test is pretty similar, all you need to do is to use chisq.test() with a frequency table:

# run the test
chisq.test(some.data.table)