How to calculate mean and variance correctly when data is in frequency "bins"?

Question

Education data was just released, providing counts for test results in bins corresponding to letter grades. The bins are given by the partition $m=[m_0,m_1,...,m_k]$ of the numerical grade scale, the frequency within each bin by $z_j=\#\{x_i|m_{j-1}\leq x_i<m_j\},\ j=1,...,k$, where $x_i, i=1,...,n$ are the individual student scores and $z_j$ is the count of observations in each bin. Only $m$ and $(z_1,...,z_k)$ are reported, the raw scores $\{x_1,...,x_n\}$ are not released.

Of course one could calculate the overall mean by assuming $\bar\mu_j=\frac{m_j-m_{j-1}}{2}$ and then aggregating $$\bar\mu=\frac1n \sum_{j=1}^k z_j \bar\mu_j.$$ This implicitly assumes that the data are symmetrically distributed (e.g., uniform) within each bin, which is unreasonable given the unimodal nature of the overall distribution.

A similar naïve formula for the variance, $$\text{Var}(x)=\frac1{n-1} \sum_{j=1}^k z_j(\bar\mu_j-\bar\mu)^2,$$ would underestimate the true variance. Assuming uniform distribution within each bin could be an improvement, but still ignores the fact that the the probability mass within each bin is likely higher at the boundary closer to the overall mean and lower at the bin boundary further away from the overall mean.

My hunch is that it would require either a parametric assumption for the overall distribution (which, given the data, I am reluctant to make) or estimating a kernel of some sort. Seems to me this is a fairly standard problem. Does someone have a solution?

Thanks for the [xreference](http://stats.stackexchange.com/questions/60256/standard-deviation-of-binned-observations). That question, and its answer, assumes equal-sized bins. Unfortunately the data in this question uses bins of varied width. Sheppard's correction assumes uniform distribution within a bin, which I suggested above as a first approximation to a solution but potentially flawed. — Sven, Dec 15 '16 at 12:06
The maximum likelihood solution suggested in the other question is potentially a parametric solution to the problem. I was hoping there is a non-arymetric solution as the data may be non-Normal. Simulation may however show that the error is too small to matter. — Sven, Dec 15 '16 at 12:10

How to calculate mean and variance correctly when data is in frequency "bins"?

0 Answers0