I have a data set where I want to compare the number of events that occurred in two geographic areas over a period of six years. Altogether, there are 407 districts that comprise the geographic areas.
In area $p$, there are 220 districts, while in area $q$, there are 187 districts.
In area $p$, 130 events happened while in area $q$, 70 events happened.
The target of comparison is $p$ against $q$, but I am not sure the right way to compare their means, owing to the imbalance of districts across the two.
In pseudo-Python code, I envisioned two ways to do it.
years = 6
p_areas = 220
# find the average number of events over time within p space only normalized by p's districts
df.loc[df['geographic_area'] == 'p']['event'].sum() / (years * p_areas)
q_areas = 187
# find the average number of events over time within q space only normalized by q's districts
df.loc[df['geographic_area'] == 'q']['event'].sum() / (years * q_areas)
or
total_area = 407
# group by p or q, sum up the events, then normalize by 407 total possible districts over 6 years
df.groupby('geographic_area')['event'].apply(lambda x: x.sum()) / (total_area * years)
Both give very different answers and I guess it comes down to what should the denominator be when comparing the two? The same (e.g., 407 districts), or districts unique to them (220 and 187)?
The event can only occur once in each district, once per year, and there are no other predictors