6

I have two samples from two different language corpora: Sample one contains 82 verbs, sample 2 contains 89 verbs. I want to compare the frequencies of a particular verb type, let's call them oral verbs, across both samples and see if they differ significantly from each other (I would have used another verb type in which I don't expect differences as a comparison group for a 4-cell chi square test). Originally, I wanted to do a chi square test but then realized that wouldn't be possible given the different sample sizes. Which test might I be able to apply? Thank you!

Soles
  • 61
  • 1
  • 2

2 Answers2

9

You can use a chi-squared test in your example with different sample sizes. Your "another verb type" would be verbs that are not oral verbs, i.e. all the other verbs

Suppose in your example, $10$ of the $82$ verbs in sample one were oral verbs and $72$ were not, while $20$ of the $89$ verbs in sample two were oral verbs and $69$ were not. Then the table for your four cell chi-squared test could look like

10  72  |  82
20  69  |  89
__ ___    ___
        |
30 141  | 171

and in R you might get

chisq.test(rbind(c(10, 72), c(20, 69)))

#     Pearson's Chi-squared test with Yates' continuity correction
#
# data:  rbind(c(10, 72), c(20, 69))
# X-squared = 2.4459, df = 1, p-value = 0.1178

so this example would not be statistically significant

Henry
  • 30,848
  • 1
  • 63
  • 107
  • Thanks Henry, that was quick and super useful! – Soles Apr 26 '19 at 08:34
  • Thank you for this good answer. I am wondering why the Yates Continuity Correction is applied (I checked the docs and figured that in the 2x2 case it is the default? [docs](https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/chisq.test) ). However I dont really understand why, I thought Yates Correction is necessary for a total N that is smaller than 40? – Björn Oct 12 '20 at 12:02
  • @BjörnB - if a continuity correction is sensible for small samples then it is also sensible for large samples; it will then only make a small difference and so used to be ignored for hand calculations, but that is not a good reason now we use computers – Henry Oct 14 '20 at 14:58
3

Just in case anyone is looking for the Python version of this, you can use scipy ch2_contingency: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chi2_contingency.html

Using the same example as @Henry

import numpy as np
from scipy.stats import chi2_contingency

obs = np.array([[10, 72], [20, 69]])
chi2, p, dof, ex = chi2_contingency(obs)
print(chi2, dof, p)
> 2.44591778277931 1 0.11783094937852609

Which is the same result as R chisq.test

Vincent
  • 495
  • 2
  • 6
  • 13