6

... instead of e.g. the popular Equal-Width-Histograms.

Additional question: What is a good/robust rule of the thumb to calculate the number of bins for equal frequency histograms (like the Freedmann-Diaconis-Rule for equal-width).

mlwida
  • 9,922
  • 2
  • 45
  • 74
  • I really wonder how an equal frequency histogram might look like. I have the intuition that it is really flat. Can you give an example? – Henrik Jan 31 '11 at 16:58
  • @Henrik: See e.g. this question (http://stats.stackexchange.com/questions/5573/how-to-build-an-equilibrated-histogram). Yes it is flat, so it clearly cannot be used for density estimation ;). However, since the equal-width approach is so generic, it seems that it can be applied in every situation equal-freq can be applied. So when to favor equal-freq ? – mlwida Jan 31 '11 at 17:09
  • 4
    @Henrik No, an equal frequency histogram generally is *not* flat. Histograms are commonly confused with bar charts, which display values by means of the *heights* of bars. However, by definition, a histogram displays frequencies by means of *areas*. Consider (*e.g.*) the data {0,1,2,4,8,16,32,64}, to be shown in the range [0,100] with two bins. The break for an equal-frequency histogram has to be between 4 and 8. If we put it at 6, the height of the left bar *multiplied* by (6-0) = 6 equals 4, whence the height is 4/6. The height of the right bar equals 4/(100-6) = 4/94. Not flat at all! – whuber Jan 31 '11 at 19:29
  • 1
    (Continued) See a Wikipedia example of a variable-width histogram at http://en.wikipedia.org/wiki/File:Travel_time_histogram_total_n_Stata.png , which is an illustration for its article on "Histogram." – whuber Jan 31 '11 at 19:31
  • 2
    @steffen Your second question has already been asked and answered at http://stats.stackexchange.com/q/798/919 . More formulas appear at http://en.wikipedia.org/wiki/Histogram#Number_of_bins_and_width . – whuber Jan 31 '11 at 19:32
  • @whuber wow. It is weird to see that one still mixes up pretty basic things. Thanks a lot for pointing my mistake out! – Henrik Jan 31 '11 at 20:50
  • @whuber: Thank you for your explanation. Your answer to the second question indicates, that all this rules are not necessarily require an equal-width-histogram. I did not know that. – mlwida Feb 01 '11 at 07:13
  • @whuber: Although it seems "strange" that one can use FD to calculate bin-width, then the number of bins and then the equal-freq-bin-width. – mlwida Feb 01 '11 at 08:37
  • @Steffen Sorry, I was mistaken. Those rules are for equal-width histograms. For equal-frequency histograms the theory is different, because you determine the *area* of each bar in advance. Thus, variation in the area is proportional to the square root of the (common) bin population. Choosing that population is therefore a tradeoff between horizontal precision (number of bars) and areal precision; where to come down in that tradeoff is your decision. – whuber Feb 01 '11 at 14:45

2 Answers2

6

This is not a proper or complete answer, but two observations from my personal experience:

  • An equal-frequency histogram will hide outliers (I've seen them in long, low bins).

  • The heights of the individual bins in an equal-frequency histogram seem more stable than in an equal-width histogram.

I use equal-frequency histograms mainly for exploratory analysis. They give me a better intuitive feel for the shape of the distribution than an equal-width histogram.

I am trying them now for an application where I am using function of a histogram of the data as a distance metric for two very skewed distributions. An equal-width histogram would have almost all of the samples in one bin, whereas an equal-frequency histogram with the same number of bins will have many narrow bins in that area. Intuitively, if we consider the height of a bin as a variable, the equal-frequency histogram will better spread the available distribution information among the variables.

Eponymous
  • 438
  • 3
  • 8
  • 1
    (+1) thank you for this helpful reply. It seems you have used them regularly. I am curious when and why you have preferred to use them (instead of e.g. equal-width). – mlwida Jul 16 '12 at 12:07
  • 1
    I use them mainly for exploratory analysis. They give me a better intuitive feel for the shape of the distribution than an equal-width histogram. I am trying them now for an application where I am using function of a histogram of the data as a distance metric for two very skewed distributions. An equal-width histogram would have almost all of the samples in one bin, whereas an equal-frequency histogram will have many narrow bins in that area. Intuitively the equal-frequency histogram will better spread the available distribution information among the variables. – Eponymous Jul 16 '12 at 16:25
  • This sounds reasonable, thank you again ! Could you be so kind to merge your last comment with your answer ? I'd like to accept it. – mlwida Jul 17 '12 at 07:07
  • There you go. My comment is now merged into the answer. – Eponymous Jul 19 '12 at 15:40
1

Equi-depth histograms are a solution to the problem of quantization (mapping continuous values to discrete values).

For finding the best number of bins, I think it really depends on what you are trying to do with the histogram. In general I think it would be best to ensure your error of choice was below some threshold (eg. Sum of squared errors < THRESH) and bin the values in that manner.

Alternatively, the number of bins can be passed in as a parameter (if you're concerned about the space consumption of the histogram).

Nick
  • 3,327
  • 6
  • 28
  • 24
  • Thank you for the response, however, I see no value in it: 1. As far as I see, Quantization is not focused (primarily or solely) on equal-freq-histograms 2. Determining the number of bins per hand or per automatic optimization (via sum-of-squared-errors) is an approach which can be applied anywhere. – mlwida Feb 14 '11 at 09:10
  • "No value" was a little bit too harsh, I meant: "no value" for the specific nature of my question which is focused on equal-freq-histograms (and rules of the thumb for it). – mlwida Feb 14 '11 at 11:19