7

As the title suggest...I have a very basic question.

I have a case with the following data:

Universe: 18840 balls total
red balls in the universe: 6680
Sample: 382 balls total
red balls in the sample: 160

I would like to estimate if the percentage of red balls in my sample is significantly different from the percentage of reds in universe.

In your opinion is it more correct to utilize a chi-square test or an hypergeometric distribution?

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
GuidoL
  • 71
  • 1
  • 3

1 Answers1

7

Take care to note you're discussing two different statistics here.

Let's set up the sampling situation in detail first so we can be clear:

We have red balls and not-red balls (for simplicity I will call them all 'black', but they could be a mix of non-red colors - it's irrelevant to this set up since they all are simply categorized as not-red).

You have a population (your 'universe') of 18840 balls, 6680 red and 12160 black. You draw a random sample of 382 balls without replacement, and obtain 160 red and 222 black.

That is, your example data are like so:

         Drawn    Not drawn    Total

Red      160        6520        6680
Black    222       11938       12160

Total    382       18458       18840

Looking at the number of reds drawn as a random variable, that has a hypergeometric distribution (though there formulated in terms of white and black balls drawn from an urn rather than red and black balls drawn from a universe).

[Conditioning on the margins gives the hypergeometric - this is also the situation used for Fisher's exact test based on the hypergeometric, and one of the situations for which the usual 2x2 chi-square test of association/test of independence applies. If you don't condition on both margins, you don't have a hypergeometric; but that's what you normally do in the specific balls-in-urns model you describe.]

If $O_{ij}$ is the observed count in cell $(i,j)$ in the above $2\times 2$ table, then your statistics are $O_{11}$ in the first case (assuming red is first) and $X^2 = \sum \sum {(O_{ij} - E_{ij})^2 \over E_{ij}}$ in the second. Both statistics are actually discrete, but you can approximate either by a continuous distribution - the first by a normal approximation, the second by a chi-square.

With random sampling, the distribution of the number of red balls in the sample ($O_{11}$) is hypergeometric - that is, given the usual assumptions it's exactly correct.

Given the universe details and the sample size, the usual 'chi-square' statistic, though discrete, will be quite well approximated by a chi-square distribution when the number of red balls in the sample is hypergeometric. It's not exact, but it will be quite close in this case.

Glen_b
  • 257,508
  • 32
  • 553
  • 939
  • I don't understand the recommendation for a chi-square statistic, given that the sample distribution is *known*: why not just compute the p-value directly from the hypergeometric distribution? – whuber Feb 28 '13 at 17:17
  • I don't know that I "recommended" a chi-square at all. I responded to the way the question was framed by discussing the relative quality (as approximations) of the two distributions - given the assumptions, the distribution of the r.v. *is* hypergeometric and given the same assumptions and the sample size, the distribution of the chi-square statistic *is* well approximated by a chi-square distribution. What recommendation is contained in either (supportably factual) statement? – Glen_b Feb 28 '13 at 21:30
  • 2
    Although you are correct, as the author of this reply you might not notice that it would be natural for most readers to interpret the phrase "you can construct" as an explicit recommendation of that approach. It seems a little disingenuous to leave it at that unless you really believe this is a good procedure--and so far you still haven't indicated what your stance is. Am I missing something? Is there anything about using the chi-squared distribution in this circumstance that is preferable to the simple, direct, completely accurate solution using the hypergeometric? – whuber Feb 28 '13 at 21:35
  • 1
    Of course there are situations where the chi-square may be *preferable* - but that wasn't in the question, which only addressed correctness. My answer directly addresses that question, by quite explicitly stating that the hypergeometric is "correct" in the intended sense. It is a *complete and correct* answer to the question that then adds further information about accuracy of the other option. Is there something you are suggesting I should add? I'm not seeking to be obtuse, I just really fail to see what the problem is here. – Glen_b Feb 28 '13 at 21:40
  • 3
    I just found the second paragraph confusing, that's all, because (1) putting "correct" in quotation marks in the first paragraph suggests you don't really believe what you wrote, so then (2) following that up with a discussion of the chi-squared test makes it look like you really are recommending it as "most correct." As such, your answer prompted my initial comment looking to clear up that confusion. – whuber Feb 28 '13 at 21:43
  • 2
    On scare-quoting "correct" - that I certainly can see could mislead. I had a particular reason for doing so (based on the mismatch between the two *statistics* under discussion - $O_{11}$ in the first case and $X^2 = \sum \sum {(O_{ij} - E_{ij})^2 \over E_{ij}}$ in the second). However, I didn't clarify that and that is clearly an issue I can deal with. So thanks, I can make my answer better. – Glen_b Feb 28 '13 at 21:52
  • @whuber Hopefully the explanation in the new preamble and the clearer emphasis on the correctness of the hypergeometric help. If you think it remains confusing, please let me know. – Glen_b Feb 28 '13 at 22:05
  • Thank you -- that makes it so much clearer. It was especially helpful to see the formulas because they make it apparent what was really meant by a chi-squared test. Just one more thing, if you will bear with me: why do you double-index the cells? Aren't there only two cells in this application, containing the red ball count (160) and non-red ball count (222) in the sample? Once again I worry that I am missing something or not understanding correctly, or perhaps even overlooking an intended generalization that I did not perceive. – whuber Feb 28 '13 at 22:40
  • @whuber - detailed clarification of the situation given. See in particular [the example at this link](http://en.wikipedia.org/wiki/Hypergeometric_distribution#Application_and_example). Hopefully it's clear enough now. – Glen_b Feb 28 '13 at 23:29
  • Once again, very helpful. What had confused me is that there is more than one chi-squared test that can be applied here. The second and third columns of the two-way table are actually superfluous: we may construct a chi-squared statistic from the data $(160, 222)$ and their expected counts $(135.4, 246.6)$. It too has a discrete distribution approximated by a chi-squared distribution with one DF. – whuber Mar 01 '13 at 00:11