Problem: I have a small bag of 166 fruits of which 62 are apples. I would like to know if my bag is enriched in apples.
Method: There is a big bag of 20,000 fruits
- I randomly sample 166 fruits from the big bag and count the number of apple
- I repeat this operation (random sampling of 166 fruits and count the number of apples) 100 times
- hypothesis
$H_0$ = 'the average number of apples across sampling is equal or greater than 62' which means the small bag is not enriched in apples
$H_1$ = 'the average number of apples across samplings is less than 62 ' which means the small bag is enriched in apples - I performed one sample Z-test like below in Python:
import pandas as pd
from scipy import stats
from statsmodels.stats import weightstats as stests
df = pd.DataFrame({'Sample': {0: 1,
1: 2,
2: 3,
3: 4,
4: 5,
5: 6,
6: 7,
7: 8,
8: 9,
9: 10,
10: 11,
11: 12,
12: 13,
13: 14,
14: 15,
15: 16,
16: 17,
17: 18,
18: 19,
19: 20,
20: 21,
21: 22,
22: 23,
23: 24,
24: 25,
25: 26,
26: 27,
27: 28,
28: 29,
29: 30,
30: 31,
31: 32,
32: 33,
33: 34,
34: 35,
35: 36,
36: 37,
37: 38,
38: 39,
39: 40,
40: 41,
41: 42,
42: 43,
43: 44,
44: 45,
45: 46,
46: 47,
47: 48,
48: 49,
49: 50,
50: 51,
51: 52,
52: 53,
53: 54,
54: 55,
55: 56,
56: 57,
57: 58,
58: 59,
59: 60,
60: 61,
61: 62,
62: 63,
63: 64,
64: 65,
65: 66,
66: 67,
67: 68,
68: 69,
69: 70,
70: 71,
71: 72,
72: 73,
73: 74,
74: 75,
75: 76,
76: 77,
77: 78,
78: 79,
79: 80,
80: 81,
81: 82,
82: 83,
83: 84,
84: 85,
85: 86,
86: 87,
87: 88,
88: 89,
89: 90,
90: 91,
91: 92,
92: 93,
93: 94,
94: 95,
95: 96,
96: 97,
97: 98,
98: 99,
99: 100},
'nb_apples': {0: 79,
1: 65,
2: 74,
3: 77,
4: 80,
5: 68,
6: 78,
7: 70,
8: 93,
9: 80,
10: 69,
11: 71,
12: 87,
13: 75,
14: 80,
15: 79,
16: 71,
17: 78,
18: 83,
19: 73,
20: 78,
21: 76,
22: 66,
23: 73,
24: 73,
25: 84,
26: 77,
27: 69,
28: 85,
29: 79,
30: 77,
31: 76,
32: 75,
33: 77,
34: 75,
35: 75,
36: 68,
37: 87,
38: 79,
39: 62,
40: 79,
41: 84,
42: 78,
43: 71,
44: 74,
45: 78,
46: 62,
47: 77,
48: 77,
49: 66,
50: 80,
51: 76,
52: 88,
53: 65,
54: 86,
55: 81,
56: 78,
57: 81,
58: 75,
59: 86,
60: 84,
61: 79,
62: 67,
63: 74,
64: 70,
65: 76,
66: 67,
67: 90,
68: 78,
69: 71,
70: 64,
71: 86,
72: 79,
73: 77,
74: 69,
75: 68,
76: 71,
77: 71,
78: 79,
79: 78,
80: 72,
81: 81,
82: 78,
83: 84,
84: 73,
85: 83,
86: 78,
87: 72,
88: 79,
89: 78,
90: 82,
91: 76,
92: 70,
93: 77,
94: 77,
95: 71,
96: 70,
97: 62,
98: 76,
99: 87}})
#Preforming the Z-test
nb_appleInsmallBag = 62
ztest,pval = stests.ztest(df['nb_apples'], x2=None, value=nb_appleInsmallBag)
print(f"average number of apples across samplings {df['nb_apples'].mean()}")
print(f'number of apple in my small bag {nb_appleInsmallBag}')
print(f'p-value: {pval}')
if pval <0.05:
print('Reject Ho')
else:
print('Accept Ho')
Here is the results:
average number of apples across samplings 76.07
number of apple in my small bag 62
p-value: 1.3457688075004756e-105
Reject Ho
I'm rejecting $H_0$. However the average number of apples across samplings (=76.07) is greater than 62, meanings that my small bag is clearly not enriched in apples. I'm confused and I'm wondering how to formulate the null hypothesis and which statistical test is the most appropriate to my problem.
Thank you in advance for shedding light on it for me.