I have counts of occurrences of two types of words (A and B) in several texts. What I would like to test is whether the frequencies of occurrence of both types of words across texts is 'correlated'. However, using Pearson's correlation is probably not correct, because my data is not continuous, and in addition the counts are often quite low (sometimes zero). What is a good way to test my hypothesis?
-
[Spearman's correlation](http://en.wikipedia.org/wiki/Spearman's_rank_correlation_coefficient) seems appropriate? – Jul 11 '14 at 15:37
-
@Matthew, I suppose I'd need to analyze proportions then, in order to factor out the text length. (The texts differ in length.) Is Spearman's correlation still a good choice for proportions? – Pavel Jul 11 '14 at 15:58
-
Consider the chi-square test for independence. http://stattrek.com/chi-square-test/independence.aspx – TrynnaDoStat Jul 11 '14 at 16:01
-
1@TrynnaDoStat the $\chi^{2}$ would be inappropriate because it ignores the *ordering* of count data, right? – Alexis Jul 11 '14 at 17:01
-
@Alexis if I'm reading the question correctly, the order of the data is not of interest here. If ordering is of interest, then you are correct the chi-square test is inappropriate. – TrynnaDoStat Jul 11 '14 at 17:35
-
1@TrynnaDoStat ordering is critical in correlation between ordered variables. – Alexis Jul 11 '14 at 17:43
-
@Ben This thread is from 2014. :) – Pavel Jan 16 '21 at 14:06
1 Answers
@Mattthew has answered your question: Spearman's $\rho$ will give you a measure of monotonic association between your variables. You can also perform inference on whether this correlation is, for example, different than zero using a straightforward t test.
To calculate $\boldsymbol{r}_{\textbf{S}}$ (assuming no ties):
- Rank each of your variables independently.
- Calculate the difference, $d_{i}$, between ranks for each observation/text (I am assuming from your question, that the measures are paired: so there's a count from text $A$ and a different count from text $B$, across n texts).
- $r_{\text{S}} = 1 - \frac{6\sum_{i=1}^{n}{d_{i}^{2}}}{n\left(n^{2}-1\right)}$
The calculation for $\mathbf{r}_{\textbf{S}}$ (regardless of ties):
Rank each of your variables independently.
Calculations proceed as for Pearson's $r$ but using the ranked values ($r_A$ and $r_B$) of the before and after (or matched) observations:
$r_{\text{S}} = \frac{\sum_{i=1}^{n}{\frac{r_{A,i} - \overline{r}_{A}}{s_{r_A}} \times \frac{r_{A,i} - \overline{r}_{A}}{s_{r_B}}}}{n-1}$
To test for evidence $\mathbf{r_{\textbf{S}} \ne 0}$:
$\text{H}_{0}\text{: }r_{\text{S}} = 0$, $\text{H}_{\text{A}}\text{: }r_{\text{S}} \ne 0$
$t = r_{\text{S}}\sqrt{\frac{n-2}{1-r^{2}_{\text{S}}}}$
Base your rejection decision for $\text{H}_{0}$ on the t distribution, with $n-2$ degrees of freedom.
Pagano, M., & Gauvreau, K. (2000). Principles of Biostatistics (2nd ed.). Duxbury Press.

- 26,219
- 5
- 78
- 131
-
Actually, there are two counts from each text. One count for one type of word (type A), and another for another type of word (type B). – Pavel Jul 11 '14 at 22:06
-
Thanks, @Alexis. I realize that I didn't ask the right question. My hypothesis is not actually about a correlations of counts as such. Because the texts are of different lengths, there will be a correlation of counts even if there is no actual correlation of probabilities of occurrence. So I guess I really need to know whether there is a correlation between two ratios, i.e. (number of times word type A appeared)/(total number of words) and the same kind of ratio for word type B. Is Spearman's rho applicable to ratios? – Pavel Jul 11 '14 at 22:10
-
In response to your first comment: yes, precisely, you have **paired measurements** on each text. – Alexis Jul 11 '14 at 22:49
-
In response to your second comment, this is an oxymoron: "there will be a correlation of counts even if there is no actual correlation of probabilities of occurrence." However, I *think* you might mean that you are interested in correlations between counts inflated zero counts? – Alexis Jul 11 '14 at 22:51
-
I'm not quite sure why it is an oxymoron. What I mean is this: let's assume that I have three texts and that the probabilities of occurrence of word A in these texts are 0.3,0.1,0.3 and those for word B are 0.1,0.2, and 0.3. The probabilities are not correlated. Let's assume that the number of words is 100 for the first text, 200 for the second text, and 300 for the third text. Thus the actual counts will be 30, 20, 90 and 10, 40, 90 respectively. These counts have a spearman correlation of 0.5. Am I missing something? – Pavel Jul 11 '14 at 23:10
-
-
What I mean by that is that if the probability of occurrence of event A is high, so is the probability of event B, and vice versa. (For positive correlations.) – Pavel Jul 11 '14 at 23:14
-
1
-
-