2

I'm confused about the way chisq.test() is handling a very simple table from an exercise where I asked students to "sample" a bag of M&Ms and compare their scope to the pooled samples from their lab mates, to wit:

> mm1<-data.frame(
+     color = c("red", "orange", "yellow", "green", "blue", "brown"),
+     mySample = c(9,10,9,15,13,7),
+     myTable = c(26,28,34,44,44,18))
> chisq.test(mm1$mySample,mm1$myTable)

    Pearson's Chi-squared test

data:  mm1$mySample and mm1$myTable
X-squared = 18, df = 16, p-value = 0.3239

Warning message:
In chisq.test(mm1$mySample, mm1$myTable) :
  Chi-squared approximation may be incorrect
> chisq.test(mm1[,2:3])

    Pearson's Chi-squared test

data:  mm1[, 2:3]
X-squared = 0.67269, df = 5, p-value = 0.9844

The second version is what I'd expect (5 d.f.), but why is the first one failing and where is it coming up with 16 d.f.? I don't think any of my expected values should be less then 5 for this set, so I don't think that error reflects the small-cell issue that's discussed here. Is.atomic() and is.numeric() on mm1\$mySample and mm1\$myTable all evaluate to TRUE, so what's the difference between passing them as an X and Y and passing them as a matrix?

D1785
  • 23
  • 4
  • 2
    The second thing one does in `R`, after trying to read the help page (which often is terse and opaque), is to examine the code. The key line is `x – whuber Sep 28 '21 at 20:28
  • This is flagged as off-topic. But this is not really a Q about how to use R (which *is* off-topic), but about understanding what it does, which is on-topic – kjetil b halvorsen Sep 28 '21 at 22:54
  • "to the pooled samples from their lab makes" (presumably you mean *lab mates* there) -- I hope those pooled samples exclude their own data. – Glen_b Sep 28 '21 at 23:32
  • 2
    Link following @whuber's comment for various ways to check source code in R: (https://stackoverflow.com/q/19226816/1834244) (several different options offered over the different answers) – James Stanley Sep 29 '21 at 19:45

1 Answers1

3

Short answer: the second version works as it treats the input data as though they were a contingency table.

From help(chisq.test)

If x is a matrix with at least two rows and columns, it is taken as a two-dimensional contingency table: the entries of x must be non-negative integers.

The "two vector" version (your first bit of code) is treating the input in a different fashion from what you expect. This treats each row like a single observation (e.g. there are five observations with the observed values 7, 9, 10, 13, and 15) and then comparing those responses with the second vector.

This is a little bit more opaque in the help file: see added emphasis below.

Otherwise, x and y must be vectors or factors of the same length; cases with missing values are removed, the objects are coerced to factors, and the contingency table is computed from these

You can see how these data are being treated in the output below:

Code example of what two vector analysis is doing

mm1 <- data.frame(
         color = c("red", "orange", "yellow", "green", "blue", "brown"),
         mySample = c(9,10,9,15,13,7),
         myTable = c(26,28,34,44,44,18))

check_chisq <- chisq.test(mm1$mySample, mm1$myTable)

Have a look at this stored object and you will see the "observed" data being interrogated:

check_chisq$observed

            mm1$myTable
mm1$mySample 18 26 28 34 44
          7   1  0  0  0  0
          9   0  1  0  1  0
         10   0  0  1  0  0
         13   0  0  0  0  1
         15   0  0  0  0  1

Hence this is being treated as a 5x5 table: which then gives 16 d.f. but as you can see is not what you want in this example.

James Stanley
  • 2,376
  • 20
  • 32
  • Thanks, @james, this is exactly the help I needed. I'm going to use this exchange as an example for the class! – D1785 Sep 29 '21 at 14:39