0

In the 1972 paper Some Graphic and Semigraphic Displays, Tukey says that

“[w]e all ought to be aware of $h^2 / 12$, the approximate increase in variance due to grouping in cells of width $h$. This reaches $2\%$ of the initial variance when $h=\sigma/2$, thus supporting the classical suggestion that frequency distributions with $10$ to $20$ occupied cells are adequate for most purposes”.

I am uncertain of how these figures were derived. Preferably an answer would show the variance before and after grouping and then showing how the numbers above fit in.

Single Malt
  • 504
  • 1
  • 5
  • 15
  • 1
    Look into Sheppard's correction, see https://stats.stackexchange.com/questions/60256/standard-deviation-of-binned-observations – kjetil b halvorsen Apr 21 '21 at 02:26
  • 1
    It does, the knowledge of Sheppard’s correction and the excellnt @whuber answer do answer the question. I may edit the question as I am still unsure of whether the single use case of this correction is if you have a binned data set without the raw values and want to obtain a better estimate of the variance. Further, how the underlying distribution affects this, say whether is normal or not is not clear to me. – Single Malt Apr 21 '21 at 16:58
  • 1
    You are correct: Sheppard's correction was traditionally used to make improved estimates with binned data. Working with such data summaries instead of the raw data is a huge time saver when computing by hand. One can develop analogous corrections for other distributions, but I think that was rarely done. The sensitivity is in the tails: assuming Normality means the corrections are not extrapolating much beyond the extreme bins. Assuming any longer-tailed distribution makes the results depend, perhaps a lot, on that assumption. – whuber Apr 21 '21 at 18:01
  • Can understand why would help while working by hand and willing to assume normality. For the case of not having the raw data perhaps a bootstrapping calculation of variance would be another alternative as may be less reliant on normality assumption. I may yet re-cast this question asking for simulations of a few different distributions to help understand degrees of normality assumption violation. – Single Malt Apr 21 '21 at 19:23

0 Answers0