Should descriptive statistics on percentages data be weighted?

Question

I'm trying to put together descriptive statistics for a set of differently sized groups of students. For each group of students, let's say I know the # of students who are left handed. I'd like to formulate what it means for a group to have a "higher than average" percentage of left-handedness.

What's a statistically sound way to define mean and standard deviation for this type of data?

Let's say the groups are A, B, C (with student size 30, 50, and 100) And the percentages of left-handedness are 5%, 25%, and 55%. The groups represent what schools the students attend, and I'd like to have descriptive data about the left-handedness tendency of each school (e.g. school C is 1.5 std above from the mean in it's rate of left-handedness). If I combine all the groups together, I can calculate the overall rate of left-handedness, but then how do I get at something like a standard deviation across the groups?

Would it make sense to do a weighted average and weighted std calculation (using student group size for the weighting)? Or does it make more sense to take each percentage as its own data point and do a non-weighted mean and std? Is the overall rate of left-handedness (across all student groups) in this case the same thing as a weighted average?

All I'm trying to do at this point is describe the data in a statistically sound way. Any pointers to resources I can read up on to get a better understanding would also be appreciated. The real data has ~600 groups, with each group varying in size from 2 to 1500. Also, how can I derive a threshold for which groups are "too small" to consider and leave them out of the overall descriptive calculations?

I'm sure I'm missing something, but why not treat all groups as one big sample, compute the mean and the SD (correcting for finite sample size), and then look for outliers (correcting for sample size in each group)? — , Oct 27 '14 at 18:11
@barrycarter - right, and that calls for reformulating the question: Are you interested in descriptive stats for the entire population of groups or for each group? What is the reason for keeping those groups separate? As for thresholds, that depends on your data type and specific question... — katya, Oct 27 '14 at 21:14
I added a bit more color to answer the questions. I'm interested in being able to compare the rate of left-handedness in each group to some measure of overall lefthandedness. Each group is a different school. @barrycarter - how can I compute the SD if I treat all groups as one big sample? Then all I have is the overall percentage (e.g. X out of 180 students are left-handed). — ValAyal, Oct 28 '14 at 00:32
@ValAyal The standard deviation of a binomial distribution is Sqrt(n*p*(1-p)): http://www.dummies.com/how-to/content/how-to-find-the-mean-variance-and-standard-deviati.html — , Oct 28 '14 at 00:37
But you haven't said anything about sampling here. How did you select students within each group are do the figure you have represent all students? If if represents all students, then no weighting is needed since, you have exactly population values - not statistics. — StatsStudent, Jan 03 '19 at 23:56

score 1 · Answer 1 · answered Jan 13 '20 at 22:28

You have a lot of groups, and for each, a binary variable $Y$, lefthanded or not. Let $\hat{p}$ be the overall proportion of lefthandedness, and $\hat{p}_i$ for group $i$. Then a good descriptive measure could be the rate ratio (or risk ratio) $$ rr_i=\frac{\hat{p}_i}{\hat{p}} $$ This could be plotted for all the groups, along a line, relative to the reference value 1. This could be more useful if in addition added confidence intervals for the risk ratios. That can be calculated via a binomial regression, but with the log link function in place of the more usual logistic. See for instance Why isn't it 'wrong' to use a log link instead of a logit one when doing GLM with a binomial family?.

Should descriptive statistics on percentages data be weighted?

1 Answers1