0

I am trying to determine if what I am trying to do makes any sense: basically, I have two populations of patients, one being roughly 12 million people, the other one being 30 individuals.

Now, from the bigger one I know that 14% of people belong to a certain group, and from the other one we have 10% that basically have the same characteristics. So what I am trying to figure out is if it makes any sense to try to test if these two fractions of the two populations are statistically significant different from each other. In my head that does not make much sense since the smaller population is - well - small and is therefore vulnerable to strong variability.

If it was, however, feasible to compare these two fractions, I would also be very delighted to get some help on how to do so.

Thank you very much!

P.Weyh
  • 87
  • 9
  • 1
    I feel more info is needed... but are you looking at this type of test https://online.stat.psu.edu/stat800/lesson/5/5.5 (where sample size is taken into consideration)?. However, I agree that your numbers seems unbalanced... it looks like your are comparing true population parameter with sample statistic – Pitouille Aug 25 '21 at 12:05
  • Hi and thanks for the input already! That is indeed the case, we have a general population of several million and now have a sample of 30 people. The idea was generally to see if the share of "severe disease" in the sample (the 10% could be considered comparable to the 14% in the general population. What more info would you need from me? – P.Weyh Aug 25 '21 at 12:20
  • 2
    If I got it right, it seems that you want to consider whether the population proportion drawn from your sample (p hat) is equivalent to the true population (p_0). So, assuming that all the requirements are met, the hypothesis test could be H_0: p=p_0 and H1: p<> p_0. Following this approach: https://sphweb.bumc.bu.edu/otlt/mph-modules/bs/sas/sas6-categoricaldata/SAS6-CategoricalData2.html – Pitouille Aug 25 '21 at 12:46
  • 1
    You might also be interested by this discussion... https://stats.stackexchange.com/questions/541433/why-cant-t-tests-be-used-for-proportions – Pitouille Aug 25 '21 at 19:14

1 Answers1

2

You have, roughly speaking, the following data.

x1 = 1680000;  n1 = 12000000
x2 = 3;  n2 = 30

You might use 'prop.test` in R to compare the two binomial proportions, but this test essentially uses a normal approximation. However, with only 3 members of your 'certain group' in the second sample, a normal approximation may not be accuract; hence the error message.

prop.test(c(x1,x2), c(n1,n2), cor=F)

    2-sample test 
    for equality of proportions 
    without continuity correction

data:  c(x1, x2) out of c(n1, n2)
X-squared = 0.39867, df = 1, p-value = 0.5278
alternative hypothesis: two.sided
95 percent confidence interval:
  -0.06735183  0.14735183
sample estimates:
prop 1 prop 2 
  0.14   0.10 

Warning message:
In prop.test(c(x1, x2), c(n1, n2), cor = F):
  Chi-squared approximation may be incorrect

A chi-squared test on the appropriate $2 \times 2$ table, does essentially the same test. And we get an error message again because of the one small count.

TBL = rbind(c(x1,x2), c(n1-x1, n2-x2))
TBL
         [,1] [,2]
[1,]  1680000    3
[2,] 10320000   27

prop.test(c(x1,x2), c(n1,n2), cor=F)

       2-sample test for equality of proportions 
       without continuity correction

data:  c(x1, x2) out of c(n1, n2)
X-squared = 0.39867, df = 1, p-value = 0.5278
alternative hypothesis: two.sided
95 percent confidence interval:
 -0.06735183  0.14735183
sample estimates:
prop 1 prop 2 
  0.14   0.10 

Warning message:
In prop.test(c(x1, x2), c(n1, n2), cor = F):
 Chi-squared approximation may be incorrect

However, the implementation of chisq.test in R, allow for simulation of a more accurate P-value, showing no significant difference between the two groups.

chisq.test(TBL, sim=T)

        Pearson's Chi-squared test 
        with simulated p-value 
        (based on 2000 replicates)

data:  TBL
X-squared = 0.14486, df = NA, p-value = 0.7861

By simulation, we have 'cured' the technical difficulty, but simulation does not creat new information. The reason we find no significant difference is that we don't have enough observations in the second group to make a valid comparison.

Another approach would be to regard the proportion $0.14$ from the large group as very nearly the true population proportion. A Jeffreys 95% confidence interval for the true population proportion based on $200\,000$ observations is $(0.1398, 0.1402).$

qbeta(c(.025,.975), x1+.5, n1-x1+.5)
[1] 0.1398038 0.1401964

So we will not be far wrong to compare the proportion $x_2/n_2 = 3/30 = 0.10$ from the first group with $p_1 = 0.14$ from the first group, using an exact binomial test (with no normal approximation). This test shows no significant difference. It's matching 95% confidence interval $(0.021, 0.265)$ also contains the Group 1 proportion $0.14,$ so it is clear that we don't have enough data to say the two groups differ in this respect.

binom.test(3, 30, .14)

        Exact binomial test

data:  3 and 30
number of successes = 3, number of trials = 30, p-value = 0.7914
alternative hypothesis: true probability of success is not equal to 0.14
95 percent confidence interval:
 0.02111714 0.26528845
sample estimates:
probability of success 
                   0.1 
BruceET
  • 47,896
  • 2
  • 28
  • 76
  • 1
    Thank you so much for this incredibly detailed answer and for helping me out in this! That really helps a lot! – P.Weyh Aug 26 '21 at 14:16