3

The assumption is that we are unable to access the underlying data of the original histograms, but we do know the number of observations in each histogram. For the case that the original histograms have the same range and the same bin counts, we can simply add the frequency of each bin from all the original histograms. However, the story is more complicated when the range and the bin counts are different.

Are there established methods for merging histograms with different range and different bin counts?

Are there any existing research interest in this topic?

Extra question: After merging the histograms, how to quantify the uncertainty of the new histogram?

Snowfish
  • 131
  • 1
  • 5

1 Answers1

4

The histogram is a density estimator! Assuming you have expressed the two histograms in this way, that is, the y-axis is expressed in density units (density is probability per unit along the x-axis), then we can express the combined histogram as a mixture density of the two given histograms. Let $f_1(x), f_2(x)$ be the two given histograms, with sample sizes $n_1, n_2$ and $n=n_1 + n_2$. Then the combined histogram $f(x)$ is $$ f(x) = \frac{n_1}{n} f_1(x) + \frac{n_2}{n} f_2(x) $$ Others will have two chime in on the Q about uncertainty.

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
  • The combined histogram would have variable bin widths? Is there a way for the combined histogram to have uniform bin widths? The goal is to visualize the combined data distribution. – Snowfish Jul 28 '17 at 00:36
  • 1
    The method above would give variable bin width, yes. To get uniform bin width, you would have to introduce approximations. Why do you want that? – kjetil b halvorsen Jul 28 '17 at 11:51
  • 1
    I am not entirely sure but here are some of my thoughts: 1) Combining many histograms can produce many thin bins which might confuse people. 2) Your solution also assumes that the observations are uniformly distributed within a bin, right? 3) What if we estimate a smooth pdf from each histogram and combine the pdfs instead? – Snowfish Jul 28 '17 at 17:51
  • @Snowfish: If you have some idea of a parametric model for the data, you can estimate parameters in that model from the histogram counts by maximum likelihood. See https://stats.stackexchange.com/questions/444755/what-methods-are-there-for-estimating-distributions-based-on-histograms. – kjetil b halvorsen Feb 18 '21 at 16:06