8

Wikipedia reports that under the Freedman and Diaconis rule, the optimal number of bins in an histogram, $k$ should grow as

$$k\sim n^{1/3}$$

where $n$ is the sample size.

However, If you look at the nclass.FD function in R, which implements this rule, at least with Gaussian data and when $\log(n)\in(8,16)$, the number of bins seems to growth at a faster rate than $n^{1/3}$, closer to $n^{1-\sqrt{1/3}}$ (actually, the best fit suggests $m\approx n^{0.4}$). What is the rationale for this difference?


Edit: more info:

enter image description here

The line is the OLS one, with intercept 0.429 and slope 0.4. In each case, the data (x) was generated from a standard gaussian and fed into the nclass.FD. The plot depicts the size (length) of the vector vs the optimal number of class returned by the nclass.FD function.

Quoting from wikipedia:

A good reason why the number of bins should be proportional to $n^{1/3}$ is the following: suppose that the data are obtained as n independent realizations of a bounded probability distribution with smooth density. Then the histogram remains equally »rugged« as n tends to infinity. If $s$ is the »width« of the distribution (e. g., the standard deviation or the inter-quartile range), then the number of units in a bin (the frequency) is of order $n h/s$ and the relative standard error is of order $\sqrt{s/(n h)}$. Comparing to the next bin, the relative change of the frequency is of order $h/s$ provided that the derivative of the density is non-zero. These two are of the same order if $h$ is of order $s/n^{1/3}$, so that $k$ is of order $n^{1/3}$.

The Freedman–Diaconis rule is: $$h=2\frac{\operatorname{IQR}(x)}{n^{1/3}}$$

user603
  • 21,225
  • 3
  • 71
  • 135
  • As I recall bin number is proportional to $n^{1/3}$, not as reported above. – Nick Cox Mar 26 '15 at 01:14
  • 1
    It's late in the day for me to check literature, but your formula rings no bells with me. – Nick Cox Mar 26 '15 at 01:33
  • Surely these are nothing more than reasonable rules-of-thumb, and therefore a discrepancy is of no theoretical import. Is there more to it than that? – Michael Lew Mar 26 '15 at 02:04
  • This is why I love the empirical CDF. The histogram is the empirical PDF and bad bin size can wreck it. The CDF, however, has better information, and is not dependent on bin size. – EngrStudent Mar 26 '15 at 04:26
  • @MichaelLew: have a look at the Freedman and Diaconis paper. The meaning of rule of thumb is context dependent! – user603 Mar 26 '15 at 08:52
  • 1
    You're not plotting $h$; you seem to be plotting $k = \text{Range }n^{1/3}/(2\text{ IQR})$ (rounded up). Unless you are standardizing your datasets to a constant value of $\text{Range}/\text{IQR}$, then this plot is confounding changes in the range with changes in $k$ (presumably the IQR will be fairly stable). So exactly what are you doing to generate this plot? – whuber Mar 26 '15 at 15:29
  • 2
    @whuber: yes that seems to be what is causing the difference: I forgot to adjust for the increase in the range. – user603 Mar 27 '15 at 10:39

1 Answers1

8

The reason comes from the fact that the histogram function is expected to include all the data, so it must span the range of the data.

The Freedman-Diaconis rule gives a formula for the width of the bins.

The function gives a formula for the number of bins.

The relationship between number of bins and the width of bins will be impacted by the range of the data.

With Gaussian data, the expected range increases with $n$.

Here's the function:

> nclass.FD
function (x) 
{
    h <- stats::IQR(x)
    if (h == 0) 
        h <- stats::mad(x, constant = 2)
    if (h > 0) 
        ceiling(diff(range(x))/(2 * h * length(x)^(-1/3)))
    else 1L
}
<bytecode: 0x086e6938>
<environment: namespace:grDevices>

diff(range(x)) is the range of the data.

So as we see, it divides the range of the data by the FD formula for bin width (and rounds up) to get the number of bins.

It seems I could have been clearer, so here's a more detailed explanation:
The actual Freedman-Diaconis rule is not a rule for the number of bins, but for the bin-width. By their analysis, the bin width should be proportional to $n^{−1/3}$. Since the total width of the histogram must be closely related to the sample range (it may be a bit wider, because of rounding up to nice numbers), and the expected range changes with $n$, the number of bins is not quite inversely proportional to bin-width, but must increase faster than that. So the number of bins should not grow as $n^{1/3}$ - close to it, but a little faster, because of the way the range comes into it.

Looking at data from Tippett's 1925 tables[1], the expected range in standard normal samples seems to grow quite slowly with $n$, though -- slower even than $\log(n)$:

enter image description here

(indeed, amoeba points out in comments below that it should be proportional - or nearly so - to $\sqrt{\log(n)}$, which grows more slowly than your analysis in the question seem to suggest. This makes me wonder whether there's some other issue coming in, but I haven't investigated whether this range effect fully explains your data.)

A quick look at Tippett's numbers (which go up to n=1000) suggest that the expected range in a Gaussian is very close to linear in $\sqrt{\log(n)}$ over $10\leq n\leq 1000$, but it seems to be not actually proportional for values in this range.

enter image description here

[1]: L. H. C. Tippett (1925). "On the Extreme Individuals and the Range of Samples Taken from a Normal Population". Biometrika 17 (3/4): 364–387

Glen_b
  • 257,508
  • 32
  • 553
  • 939
  • So, the number of bins does scales as $\approx n^{0.4}$ (at gaussian data)? – user603 Mar 26 '15 at 04:04
  • 1
    Not really, no. More details added. – Glen_b Mar 26 '15 at 04:34
  • +1. I understand the arguments as to why the number of bins should grow at $n^{1/3}$. What I don't understand is what causes the difference between that number and the actual one. I mean, the R code is literally implementing the formula and the test i'm doing are as close to an optimal setting (n large, gaussian data) as one would get. – user603 Mar 26 '15 at 15:05
  • 1
    The actual Freedman-Diaconis rule is **not** a rule for the number of bins, but for the bin-width. By their analysis, the bin width should be proportional to $n^{-1/3}$. Since the total width of the histogram must be closely related to the sample range (it may be a bit wider, because of rounding up to nice numbers), and the expected range changes with $n$, the number of bins is not quite inversely proportional to bin-width. So the number of bins *should not* grow as $n^{1/3}$ - at least not quite, because of the way the range comes into it. – Glen_b Mar 26 '15 at 17:55
  • 3
    The reasoning you quote from wikipedia in your question doesn't take the effect of the sample range into account. – Glen_b Mar 26 '15 at 18:02
  • 1
    I think this solves it. – user603 Mar 27 '15 at 10:35
  • 2
    If I apply [this math.SE post](http://math.stackexchange.com/questions/89030/expectation-of-the-maximum-of-gaussian-random-variables) correctly, the range should grow as $\sqrt{\log(n)}$. – amoeba Mar 27 '15 at 11:13
  • 1
    @amoeba thanks for that, very interesting. By the look of it, it should be a tad smaller than $\sqrt{\log(n)}$, but the difference will be so small it's probably not worth worrying over. – Glen_b Mar 27 '15 at 11:39
  • 1
    @amoeba Actually, I just checked against Tippett's values from integration, and the values are really close to $c\sqrt{\log(n)}$ – Glen_b Mar 27 '15 at 11:46
  • And $n^{1/3}\sqrt{\log(n)}$ when $\log(n)$ is between 4 and 12 does look pretty close to $n^{0.4}$ observed by @user603 -- linear loglog fit gives me slope of $0.37$. – amoeba Mar 27 '15 at 12:00
  • @amoeba I've seen an asymptotic formula for the expected range in Gaussians in a paper before that I've been looking for since I first posted this answer, but couldn't locate it -- which is why I then went back to Tippett to show it was slower than $n^{0.4-1/3}$ (i.e. $n^{1/15}$). I wish I could find the paper, but it's been nearly 30 years since I read it and I haven't come up with the right search terms yet. With your link there's a suitable formula, fortunately. – Glen_b Mar 27 '15 at 22:01
  • May I ask why you are looking for values of expected range in the 1925 table, when it should be very easy (probably a couple of lines in R) to obtain these numbers by simulation, including larger values of $n$? – amoeba Mar 28 '15 at 00:49
  • 1
    @amoeba Tippett computed the values by numerical integration. They should be accurate to something near 7 figures, I think. To get that accuracy from simulation would require *very* large numbers of samples ... and I could easily make an error, while Tippett's values are well-tested. I have no objection to simulation (I use it all the time), but simulation isn't always the best choice. I would be better replicating the numerical integration, perhaps but I probably wouldn't get more accuracy unless I went to a great deal of effort (imagine the effort of doing 1000 values by hand, back in 1925). – Glen_b Mar 28 '15 at 00:52
  • 1
    @amoeba scratch that... more like 5 to 6 figures. I should have double checked the table and the discussion before I wrote the comment. At that sort of accuracy, simulation might be sensible. – Glen_b Mar 28 '15 at 01:01
  • 1
    @amoeba Oddly I had previously identified the $\sqrt{\log n}$ effect in an answer a couple of years before I wrote this one... but completely forgot about that. http://stats.stackexchange.com/questions/80444/sample-range-to-estimate-stdev/80449#80449 – Glen_b Aug 18 '16 at 03:57
  • What to do if the mad function returns 0 also? – Robur_131 Mar 22 '20 at 18:31
  • MAD doesn't enter into the Friedman-Diaconis rule. If you have a question about some histogram rule that does use MAD you will need to post a question. – Glen_b Mar 22 '20 at 22:11