How are various rules for determining frequency distribution binwidth derived?

Question

I've been looking up the problem of deciding appropriate binwidths for histograms and here's my broad-level understanding so far:

If we have $n$ data points, we assume that they're realizations of $n$ random variables following an unknown distribution $f$. Histograms are essentially density estimators that attempt to derive an estimate $\hat{f}$ of the underlying distribution. $\hat{f}$ depends on our choice of binwidth $h$, and an appropriate choice would give us an estimate that is close to the original distribution. The "closeness" is characterized by integrated mean square error.

There are a few rules for binning as given on Wikipedia, like the Freedman-Diaconis' choice, Sturges' rule and so on. My question is: how were these rules derived in the first place given that we don't know $f$? If we don't know $f$, we can't explicitly calculate IMSE and no optimization can be done. Were these rules applied to simulated data sets generated from a wide variety of probability distributions, and selected because they worked in most of those cases?

I'm not looking for exact derivations at this point, just the paradigm.

Some discussion of Sturges' rule and its connection to Doane's formula (and pointers to a paper by Rob Hyndman discussing a problem with them) is [here](https://stats.stackexchange.com/questions/55134/doanes-formula-for-histogram-binning/55205). Discussion of some other rules can be found on site as well. Try some searches for [Freedman-Diaconis](https://stats.stackexchange.com/search?q=Freedman-Diaconis) for example — Glen_b, May 30 '17 at 11:53

How are various rules for determining frequency distribution binwidth derived?

0 Answers0