Beginner question on statistical comparison between unbalanced groups and a variable dependent on group size

Question

I am trying to compare the success rate of males vs females in my competition dataset. I have several competitions outcomes, including the rank of each participant. Overall participants in all competitions are 70% Males and 30% Females.

Let's say I see Males rank on average higher - It's not a fair comparison since there are more Males in my data.

What's the correct way to compare in such case?

score 1 · Answer 1 · answered Feb 16 '21 at 17:36

Unequal sample sizes do not present a problem. The math going on in the usual hypothesis tests (probably a t-test) addresses the unequal sample sizes. In experimental design, you will have maximum power by having equal sample sizes, but there are other considerations, such as cost. Perhaps it would be prohibitively costly to balance the sample sizes, but you have $1000$ volunteers. Even though the ideal is a $500/500$ split, if you split them $700/300$ (The ratio you have), perhaps you can adequately power your study while remaining within your budget.

score 0 · Answer 2 · answered Feb 16 '21 at 17:25

0

If you are going to calculate the AVERAGE only, you don't need to deal with your unbalanced dataset. Just take the average for Males and Females, separately. If there are some outliers that change the average value significantly, Median may be a better choice.

BUT

if you want to develop a prediction model, you must perform over/under-sampling on your training dataset, and then develop your model.

answered Feb 16 '21 at 17:25

Mehdi

195
4

Why do you have to over/under-sample? Is that so the accuracy performance metric isn’t biased towards a high number even though always guessing one group will give better than $50\%$ accuracy? – Dave Feb 16 '21 at 17:32
Without over/under-sampling, the model cannot predict well the minority class. You can find lots of articles, just Google it. – Mehdi Feb 16 '21 at 17:35
I like what our Frank Harrell Tweeted about this topic: https://mobile.twitter.com/f2harrell/status/1062424969366462473?lang=en. Stephan Kolassa has a nice post on here, too: https://stats.stackexchange.com/questions/357466/are-unbalanced-datasets-problematic-and-how-does-oversampling-purport-to-he. – Dave Feb 16 '21 at 17:37
so use it bro if you get good results :). It may work for some problems. – Mehdi Feb 16 '21 at 17:42

Beginner question on statistical comparison between unbalanced groups and a variable dependent on group size

2 Answers2