Statistical test for comparing two frequencies with R

Question

i have this situation. A set of 5000 objects, 4950 blue and 50 green. From this set two people (person A and person B) fish separately 100 objects each and i would like to know if one of the two people has tricked me (not fishing blinded).

person A fished one set of 100 objects, 90 blue and 10 green.
person B fished one set of 100 objects, 99 blue and 1 green (that would be expected by chance)

Which test should i use? Could someone point me to a R example/solution about it? Thanks in advance

Questions about choosing statistical tests are off topic for Stack Overflow. If you need statistical advice, you should ask your question at [stats.se] instead. Once you know the right test to use, you can probably easily google how to do that in R, but if you are still stuck, then you can ask a question like that here. — MrFlick, Sep 27 '18 at 16:59
I agree that CV is the place to ask, but you should fix your typo about person B. — meh, Sep 27 '18 at 17:16

score 1 · Answer 1 · answered Sep 27 '18 at 17:37

The first idea that comes to mind would be to compute the likelihood of fishing each set of 100 objects, and using a threshold on this likelihood to detect cheaters.

If you know what cheaters would be inclined to do (for example, green objects are more desirable and they would boost their number of green objects fished), then you can look for that directly.

In the example you gave, it seems like you have a 0.01 probability of fishing a green object. This is a Bernoulli trial, and the distribution for the number of green objects in 100 trials follows a binomial distribution $X \sim B(100, 0.01)$. You can use the cumulative distribution function for the binomial distribution to determine at how many green objects you should start worrying (essentially, find the $x$ for which $P(X \geq x) < p$ where $p$ is how unlikely a result must be for it to be worrisome)

R Greg Stacey · Answer 2 · 2018-09-28T18:48:46.350

1

Edit: Fisher's exact is the wrong test, but a hypergeometric test is appropriate.

Following the answer to a similar question, you can test how "unlikely" either proportion is using ~~Fisher's exact test or~~ a hypergeometric test. From your question, you're interested in whether the proportion of blue:green for either person (90:10 person A, 99:1 person B) significantly differs from the true proportion (4950:50). In that case you have two contingency tables:

$$ \array{& \text{Blue} & \text{Green} \\\text{Person A} & 90 & 10 & 100\\\text{Truth} & 4950 & 50 & 5000 \\ & 5040 & 60} $$

$$ \array{& \text{Blue} & \text{Green} \\\text{Person B} & 99 & 1 & 100\\\text{Truth} & 4950 & 50 & 5000 \\ & 5049 & 51} $$

and you'd want to test both tables. Since the hypergeometric distribution models the probability of getting a certain number of draws without replacement, i.e. your situation, you can use phyper in R to run a hypergeometric test:

pA = phyper(10-1,50,4950,100, lower.tail=F)
pB = phyper(1-1,50,4950,100, lower.tail=F)

Which gives pA=3.4e-08 and pB=0.64. (The -1 in phyper(10-1,...) because we want the probability of getting greater than or equal to that number of green draws.)

So, by the logic of the hypergeometric test, Person A's basket of fish is highly unlikely to have occurred by chance, while person B's basket is totally reasonable.

edited Sep 28 '18 at 18:48

answered Sep 27 '18 at 18:26

R Greg Stacey

2,202
2
15
30

This is a good solution for the question you state--but I don't think it's the question that was asked. We already know the contents of the set, so the issue isn't one of comparing the two samples, but only of comparing the 90-10 sample to the known contents of the set. – whuber Sep 27 '18 at 18:36
@whuber Isn't the question asking how to test whether 90:10 is likely to have been drawn from 4950:50, and similarly for 99:1? Doesn't Fisher's exact test that? – R Greg Stacey Sep 27 '18 at 18:40
I don't see how Fisher's test applies to a *single* sample. As @Vincent pointed out in his answer, the sample count of $10$ has approximately a Binomial distribution (it's really hypergeometric, but close enough) and that suffices for a full analysis of the result. Where it gets interesting is in a different formulation, which (generalizing a bit) supposes there are $m$ independent samples from this set and asks which (if any) are inconsistent with the hypothesis that it is a simple random sample. I don't see a direct application of Fisher's test to that interpretation, either. – whuber Sep 27 '18 at 18:45
@whuber I think I'm doing a two sample, but the set itself is the second sample. Is my error in using Fisher's exact to compare person A's sample (90:10) to the entire set (4950:50)? – R Greg Stacey Sep 27 '18 at 19:10
@whuber Sorry to belabor the point, but I'm curious because using Fisher's exact here seems similar to gene set enrichment analyses, where a proportion in a subset of genes is compared to a proportion in the entire set of genes. e.g. https://www.biostars.org/p/110781/ I'd love to clear up my confusion. Happy to discuss this outside of comments. – R Greg Stacey Sep 27 '18 at 19:34
1

For this question, the entire set is the population--it's not a sample and is not subject to any uncertainty. It is of some interest to compare the Fisher test result to the hypergeometric probability of observing 10 or more green balls in the sample of 100. `R`'s function `phyper(9, 50, 4950, 100, lower.tail=FALSE)` reports that as 3.4e-8, which is just one-fifth the Fisher Test p-value, showing that they do give different results. In this situation the answers will never be too far apart because the population is large; with smaller populations the answers can differ more. – whuber Sep 27 '18 at 19:53
1

@whuber Thank you for explaining. My (new) understanding is that Fisher's exact is not appropriate for comparing one sample to the population, but a hypergeometric test is. I'll edit my answer to use the hypergeometric test. – R Greg Stacey Sep 27 '18 at 19:59

scs · Answer 3 · 2018-10-02T13:23:27.400

1

This answer was corrected according to the comment of phuber.

You need to use the hypergeometric distribution. In contrary to the binary distribution, the hypergeometric distribution is for drawing without replacement. https://en.wikipedia.org/wiki/Hypergeometric_distribution

In R, the hypergeometric test of the hypergeometric distribution is implemented using phyper.

phyper(x, m, n, k)

x are the drawn green objects,

m are the total green objects in the set,

n are the total blue objects in the set,

k are the total number of objects drawn

With your numbers, this results in

pA = phyper(10-1,50,4950,100, lower.tail=F)

pB = phyper(1-1,50,4950,100, lower.tail=F)

The -1 in phyper(10-1,...) is there because we want the probability of getting greater than or equal to that number of green draws.

edited Oct 02 '18 at 13:23

answered Sep 27 '18 at 20:03

scs

111
3

The hypergeometric distribution is the right idea, but its probability mass function is not a correct way to solve this problem. See the other answers in this thread for correct approaches. – whuber Sep 30 '18 at 21:23
Thank you. I corrected the answer including the off by one error and named your contribution. – scs Oct 02 '18 at 13:24

Statistical test for comparing two frequencies with R

3 Answers3

Related