4

Consider the following R code and output:

row1 = c(0,23,0,0)
row2 = c(0,1797,0,0)
data.table = rbind(row1, row2)
chisq.test(data.table)

    Pearson's Chi-squared test

data:  data.table
X-squared = NaN, df = 3, p-value = NA

Now consider the same in Python:

import scipy.stats
scipy.stats.chi2_contingency([[0,23,0,0], [0,1797,0,0]])

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/dist-packages/scipy/stats/contingency.py", line 236, in
     chi2_contingency
    "frequencies has a zero element at %s." % zeropos)
ValueError: The internally computed table of expected frequencies has a zero element at [0, 0, 0, 1, 1, 1].

Is this expected behaviour? Should I just trap for the error in Python. A search for the message "The internally computed table of expected frequencies has a zero element at" did not reveal anything useful.

SabreWolfy
  • 1,101
  • 2
  • 15
  • 25
  • 6
    The real issue comes not because some observed cells are 0 but because some columns are all-zero. This makes the expected values in that column zero, which makes the contribution to chisquare of each cell in that column, $(O_i-E_i)^2/E_i = 0/0$. You can't compute a chi-square in that situation; in R, $0/0$ is `NaN`, and a sum that includes a `NaN` is a `NaN` - it needn't trap it because it returns the 'right' answer by default. Before calling the scipy function, check your row and column totals are all > 0. – Glen_b Oct 25 '13 at 00:37

2 Answers2

7

The problem is not that there are 0 cells, the problem is that only one column has any data. E.g

row1 = c(100,23,0,100)
row2 = c(0,1797,100,0)
data.table = rbind(row1, row2)
chisq.test(data.table)

works fine

and

row1 = c(10,0,0,100)
row2 = c(0,1797,100,0)
data.table = rbind(row1, row2)
chisq.test(data.table)

gives only a notice that the approximation may be incorrect - here an exact test should be used.

Even

row1 = c(23,0,0,0)
row2 = c(0,1797,0,0)
data.table = rbind(row1, row2)
chisq.test(data.table)

gives only that same warning.

Peter Flom
  • 94,055
  • 35
  • 143
  • 276
5

They're both errors but in R it just reported NaN.

The reason they are errors likely has to do with divide by 0 issues. You must have some kind of count in each cell, typically at least 4-7 is preferred. See any online article on the assumptions and requirements of a chi-square test. It tests independence but it can't do so with no data in either cell in a 2 by X design.

If the problem is just that python will exit then, by all means, trap the error.

John
  • 21,167
  • 9
  • 48
  • 84
  • Is there a better or alternative test to use in this case, as detailed here: http://stats.stackexchange.com/q/73575/2824 – SabreWolfy Oct 24 '13 at 20:28
  • Get more data. You've got nothing to test in 3 out of 4 pairs. How could anything tell if they're independent? – John Oct 24 '13 at 20:58
  • There is no data for those points. I must be using the wrong test. – SabreWolfy Oct 24 '13 at 21:00
  • This is not correct. See my answer – Peter Flom Oct 24 '13 at 22:47
  • Peter's right it's incorrect, because there just is no test. It's not that there are any 0 cells, which while bad is not in itself catastrophic. The problem is, as I said above, that you're missing pairs of them. That leads to divide by 0 as Glen_b's comment indicates but it also leads to having no data to test for independence. So, you could be missing some but you're missing too much. I might even recommend, if you only had one 0 column and the chi-square couldn't be calculated, to bootstrap. But there's just not enough here with only one column containing counts. – John Oct 25 '13 at 01:14
  • Frank Harrell on this site has a pretty epic post on evaluating [contingency tables with small counts](http://stats.stackexchange.com/a/14230/1036). If you use the `N-1` correction you only need to have expected frequencies around 1 for nominal coverage rates. – Andy W Oct 25 '13 at 13:00