How would I compare these means correctly?

Question

I have a data set where I want to compare the number of events that occurred in two geographic areas over a period of six years. Altogether, there are 407 districts that comprise the geographic areas.

In area $p$, there are 220 districts, while in area $q$, there are 187 districts.

In area $p$, 130 events happened while in area $q$, 70 events happened.

The target of comparison is $p$ against $q$, but I am not sure the right way to compare their means, owing to the imbalance of districts across the two.

In pseudo-Python code, I envisioned two ways to do it.

years = 6
p_areas = 220
# find the average number of events over time within p space only normalized by p's districts
df.loc[df['geographic_area'] == 'p']['event'].sum() / (years * p_areas)

q_areas = 187
# find the average number of events over time within q space only normalized by q's districts
df.loc[df['geographic_area'] == 'q']['event'].sum() / (years * q_areas)

or

total_area = 407
# group by p or q, sum up the events, then normalize by 407 total possible districts over 6 years
df.groupby('geographic_area')['event'].apply(lambda x: x.sum()) / (total_area * years)

Both give very different answers and I guess it comes down to what should the denominator be when comparing the two? The same (e.g., 407 districts), or districts unique to them (220 and 187)?

The event can only occur once in each district, once per year, and there are no other predictors

You need to give some more details and context. Can the event occur only once in each district? Or at most once per year? ... Do you have other predictors? ... — kjetil b halvorsen, Jan 30 '21 at 22:23
Thanks for your interest and time. The event can only occur once in each district, once per year. No other predictors. — John Stud, Jan 31 '21 at 02:24
Can you please add this new information to the Q as an edit? Not everybody reads comments! — kjetil b halvorsen, Jan 31 '21 at 05:51
These events are independent? E.g. there are no events that tend to cluster in neighboring districts? — Sextus Empiricus, Feb 02 '21 at 20:26

score 1 · Accepted Answer · answered Feb 02 '21 at 20:03

1

Since there are no other predictors, it comes down to "coin toss experiment".

You toss coin $P$ a total of $220\cdot6=1320$ times, and you see $130$ times "heads". You toss coin $Q$ a total of $187\cdot6=1122$ times, and you see $70$ times "heads". So, the relative frequency of observing heads is $\hat{p}=187/1320=14.2\%$ for coin $P$ and $\hat{q}=70/1122=6.2\%$.

This assumes that the occurence of the events ("heads") is independent between districts and years ("coin tosses") and areas ("coins")!

Is there a test for the hypothesis that coins $P$ and $Q$ are different? Yes, there is, or actually two, you can find them in the answers to the question here :-) Using Fisher's exact test in R with signficance level $\alpha=0.05$, the answer is:

> fisher.test(rbind(c(130, 1320-130), c(70, 1122-70)), alternative = "two.sided")

    Fisher's Exact Test for Count Data

data:  rbind(c(130, 1320 - 130), c(70, 1122 - 70))
p-value = 0.001402
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
 1.20303 2.25472
sample estimates:
odds ratio 
  1.641444

The p value is smaller than $\alpha$, so you reject the null hypothesis that the coins ahem areas have the same probability of seeing heads ahem the event.

answered Feb 02 '21 at 20:03

Edgar

1,391
2
7
25

This is helpful, but I would not consider the events independent. How might we compare them in this case? – John Stud Feb 02 '21 at 21:47
Are they independent between areas or districts or years? – Edgar Feb 02 '21 at 21:58
Perhaps by years only. – John Stud Feb 02 '21 at 22:11
If you assume dependence between districts, it will be very hard to design a test or model that assesses the difference between areas in a statistically reasonable way. In this case it would be best to report 14.2% and 6.2% as descriptive statistics and don't talk about significance/hypothesis tests at all. – Edgar Feb 02 '21 at 22:16
This is because there could be some dependence structure between districts were every event makes a next event more likely, such that in the end 130 and 70 events would be equally likely to observe in areas of this size. Without prior knowledge about the dependence structure, one cannot make any reasonable claim. – Edgar Feb 02 '21 at 22:18
What about comparing their means, without wondering whether they are statistically different? – John Stud Feb 02 '21 at 22:22
1

Yes, 14.2% vs 6.2%. That's the most reasonable thing to report, I guess. – Edgar Feb 02 '21 at 22:23

How would I compare these means correctly?

1 Answers1